Why it can't peek ahead
Mask the upper triangle to −∞ before softmax.
1When the model writes, it predicts one word at a time, so as it decides what comes after “drinks”, the word that will come next (“milk”) hasn’t been written yet. Which words must “drinks” be stopped from reading?your turn
Tap the words “drinks” must not peek at:
1 to find→ continue← backR replay
One head sees one kind of relationship: what about others?
3.5 Multi-head attentionBuilds on3.3How attention blends meaning
Common questions
What is "Why it can't peek ahead" about?
Mask the upper triangle to −∞ before softmax.
What problem does it solve?
The model could peek at future words during training.
What will I be able to do after this lesson?
You can explain causal masking: a word sees only itself and earlier words.
What comes next?
One head sees one kind of relationship: what about others?