The thinking dial: when more hurts
More thinking has diminishing then negative returns, so match the budget to the task.
1Thinking longer made the model smarter. So crank the dial to max, right?
less thinkingmore thinking β
Cost and latency rise the whole way. Does quality?
the catchMore thinking bought more accuracy last lesson, so just max it out, always?
Why this looks like a free win
Last lesson: spending more compute at answer-time, letting the model think longer, bought more accuracy. The obvious move is to turn that dial all the way up, all the time. It's not free, but smarter is worth it, right?
β continueβ backR replay
Generating tokens is the bottleneck, can we make it faster?
4.11 Speculative decoding: two models, one fast answerCommon questions
What is "The thinking dial: when more hurts" about?
More thinking has diminishing then negative returns, so match the budget to the task.
What problem does it solve?
If thinking helps, just crank it to max, right?
What will I be able to do after this lesson?
You can reason about thinking budgets: when more reasoning helps, and when it wastes time and money.
What comes next?
Generating tokens is the bottleneck, can we make it faster?