Skip to content
← All explainers

Plain-language explainer

LLM evals, explained

What are evals, and how do teams know an AI feature actually works?

An eval is a repeatable test for an AI feature: a set of inputs, and a way to score whether the outputs are good enough. Because models are non-deterministic and 'looks fine' does not scale, teams build evals to catch regressions before users do. Scoring can be exact checks, rubrics, or another model acting as a judge. The hard part is keeping evals honest: a frozen offline set can go stale or leak into training, so production and adversarial tests catch what it cannot.

Do not just read it. Operate the mechanism yourself in a short interactive lesson.

See it work: Evals: proving it works β†’

Free, no code, no signup.

What people get wrong

  • A high offline score means it is safe to ship. The set can be stale or contaminated; watch live traffic too.
  • Evals are just unit tests. They score fuzzy quality, often with rubrics or a model judge, not exact equality.
  • A model judge is unbiased. Judges have biases and need calibration against human ratings.

Where you see it in real products

  • Teams gate releases on an eval suite, like tests in CI.
  • Online evals score a sample of real traffic after launch.
  • Red-team evals probe for failures and prompt injection before users find them.

Related explainers

Part of See How AI Works, a free interactive course, where you learn how modern AI works by operating it, not watching videos.