Attention scores, softmax & the weighted sum
The idea: Q·K → scores → softmax → weighted sum of Values.
What you'll be able to do: You can explain attention end to end: Q·K → softmax → weighted sum of Values.
The problem it solves: How do raw scores become a blend of meanings?
Builds on: Query, Key, Value
← Query, Key, Value · Next: Causal masking →
All lessons