MLmediumTransformers attention~15m

Scaled Dot-Product Attention

Name: Mockbit
Price: 9.99 USD

Problem

Explain why the dot products in attention are divided by the square root of the key dimension, and what goes wrong during training if you skip this scaling.

Reference solution

Reference solution available after you attempt the question.

Ready to solve it?

Start a session on Mockbit #77. You'll get graded with specific critique when you submit.