What are evals, and how do teams know an AI feature actually works?

Question

Accepted Answer

An eval is a repeatable test for an AI feature: a set of inputs, and a way to score whether the outputs are good enough. Because models are non-deterministic and 'looks fine' does not scale, teams build evals to catch regressions before users do. Scoring can be exact checks, rubrics, or another model acting as a judge. The hard part is keeping evals honest: a frozen offline set can go stale or leak into training, so production and adversarial tests catch what it cannot.

LLM evals, explained

What people get wrong

Where you see it in real products

Related explainers