LLM-as-a-judge
The idea: Use a strong model as the grader against a rubric — pointwise scores or pairwise 'which is better.' Fast and scalable, but watch its biases (order, verbosity, self-preference) and calibrate it against human labels.
What you'll be able to do: You can explain LLM-as-a-judge, its biases, and how to make it trustworthy.
The problem it solves: You have 10,000 open-ended answers to grade. Humans can't score them all.
Builds on: Evals: proving it works
← Evals: proving it works · Next: Datasets, labeling & ground truth →
All lessons