7.10E · Exhibit●●○○○
Reading benchmarks safely
A single score hides traps: contamination (test answers leaked into training), saturation (everyone clustered near 100%), jagged skills (great at one task, poor at a neighbor), and weak domain transfer. The leaderboard is not your task.
You'll get more from this if you've seen7.3Evals: proving it works
1A leaderboard says 92% vs 89%. Before you trust it — make the call.
The wall
Model A scores 92% on a public leaderboard. Model B scores 89%. The numbers are right there, A is higher — so A is the better choice for your project, right?
Leaderboard · overall score
Model A
92%
Model B
89%
Is the higher-scoring model the better choice for your task?
space play/pause←→ stepR replay
Common questions
What is "Reading benchmarks safely" about?
A single score hides traps: contamination (test answers leaked into training), saturation (everyone clustered near 100%), jagged skills (great at one task, poor at a neighbor), and weak domain transfer. The leaderboard is not your task.
What problem does it solve?
Model A scores 92% on the leaderboard, Model B 89%. So A is the better choice for you — right?
What will I be able to do after this lesson?
You can interpret public benchmark scores — contamination, saturation, jagged capability, domain transfer — and why one number can't answer “is this good for my task?”.
What comes next?
Benchmarks probe what models can do — but what can they never do?
Benchmarks probe what models can do — but what can they never do?
7.11 The edge of the map: what LLMs can't do