Mixture of Experts
The idea: A router sends each token to a few expert FFNs; sparsity decouples cost.
What you'll be able to do: You can explain Mixture of Experts: route to a few experts, all knowledge, low cost.
The problem it solves: How to add parameters without paying for them every token?
Builds on: Matrix × vector as a neural layer
← Scaling laws · Next: Quantization →
All lessons