Multi-head attention
The idea: Several heads attend to different relationships in parallel.
What you'll be able to do: You can explain multi-head attention: parallel heads, each a different pattern.
The problem it solves: One attention pattern can't track a word's subject and its neighbors at once.
Builds on: Attention scores, softmax & the weighted sum
← Causal masking · Next: Matrix × vector as a neural layer →
All lessons