Question 1

What is "Causal masking" about?

Accepted Answer

Mask the upper triangle to −∞ before softmax.

Question 2

What problem does it solve?

Accepted Answer

The model could peek at future words during training.

Question 3

What will I be able to do after this lesson?

Accepted Answer

You can explain causal masking: a word sees only itself and earlier words.

Question 4

What comes next?

Accepted Answer

One head sees one kind of relationship: what about others?