Distillation: a small model learns from a big one
The idea: Train a small 'student' model to imitate a big 'teacher' (its outputs/probabilities), capturing much of the skill at a fraction of the size and cost — most small fast models you use are distilled.
What you'll be able to do: You can explain distillation: a small student model trained to mimic a big teacher, for cheap inference.
The problem it solves: Frontier models are huge and pricey to serve. Must every task pay the full cost?
Builds on: Quantization
← Speculative decoding: two models, one fast answer · Next: The model is frozen and stateless →
All lessons