Causal masking
The idea: Mask the upper triangle to −∞ before softmax.
What you'll be able to do: You can explain causal masking: a word sees only itself and earlier words.
The problem it solves: The model could peek at future words during training.
Builds on: Attention scores, softmax & the weighted sum
← Attention scores, softmax & the weighted sum · Next: Multi-head attention →
All lessons