The transformer block, assembled & stacked
The idea: attention + FFN + residual/norm = a block; stack N blocks (nano-GPT).
What you'll be able to do: You can describe a whole transformer block and how stacking it predicts the next word.
The problem it solves: How do the parts fit into the whole machine?
Builds on: Attention scores, softmax & the weighted sum, Matrix × vector as a neural layer, Residuals & LayerNorm
← Residuals & LayerNorm · Next: Decoding & sampling knobs →
All lessons