Plain-language explainer
LLM evals, explained
What are evals, and how do teams know an AI feature actually works?
An eval is a repeatable test for an AI feature: a set of inputs, and a way to score whether the outputs are good enough. Because models are non-deterministic and 'looks fine' does not scale, teams build evals to catch regressions before users do. Scoring can be exact checks, rubrics, or another model acting as a judge. The hard part is keeping evals honest: a frozen offline set can go stale or leak into training, so production and adversarial tests catch what it cannot.
Do not just read it. Operate the mechanism yourself in a short interactive lesson.
See it work: Evals: proving it works βFree, no code, no signup.
What people get wrong
- A high offline score means it is safe to ship. The set can be stale or contaminated; watch live traffic too.
- Evals are just unit tests. They score fuzzy quality, often with rubrics or a model judge, not exact equality.
- A model judge is unbiased. Judges have biases and need calibration against human ratings.
Where you see it in real products
- Teams gate releases on an eval suite, like tests in CI.
- Online evals score a sample of real traffic after launch.
- Red-team evals probe for failures and prompt injection before users find them.
Related explainers
Part of See How AI Works, a free interactive course, where you learn how modern AI works by operating it, not watching videos.