Plain-language explainer
Attention, explained
What does the attention mechanism do in a transformer?
Attention lets each word look at the other words in the sentence and decide which ones matter for it right now. In 'the trophy did not fit in the suitcase because it was too big', attention is what tells the model that 'it' refers to the trophy. Each position gathers a weighted blend of the others, leaning hardest on the ones that fit. This is how a model handles pronouns, long-range references, and the way a word's meaning shifts with its context.
Do not just read it. Operate the mechanism yourself in a short interactive lesson.
See it work: How attention blends meaning βFree, no code, no signup.
What people get wrong
- Attention reads strictly left to right. It can weigh every earlier word at once, not just the previous one.
- It is keyword matching. It is a learned, weighted blend of meaning, not exact-word lookup.
- More attention heads always means better. Heads specialize, and past a point you get diminishing returns.
Where you see it in real products
- Every modern chat and coding model is built on stacked attention layers.
- Long-document understanding depends on attention linking distant parts.
- Quality on pronouns, code references, and citations comes from attention working well.
Related explainers
Part of See How AI Works, a free interactive course, where you learn how modern AI works by operating it, not watching videos.