Speculative decoding: two models, one fast answer
The idea: A small, fast model drafts several tokens ahead; the big model checks them all in one pass, keeping the run it agrees with. Same output, fewer slow steps.
What you'll be able to do: You can explain speculative decoding: a small model drafts, the big model verifies, for faster output.
The problem it solves: The model writes one token at a time, in order. That's slow — can we cheat the wait?
Builds on: Decoding & sampling knobs
← The thinking dial: when more hurts · Next: Distillation: a small model learns from a big one →
All lessons