← All terms
Evaluation
LLM-as-judge
Using a model to evaluate the outputs of another model.
LLM-as-judge uses a (usually frontier) model to score the outputs of your production model against a rubric you provide. It is the standard pattern for evaluating open-ended generation tasks where deterministic scoring is impossible. Calibrate the judge against human ratings on a sample before trusting its scores at scale, and watch for known biases (judges prefer longer outputs, prefer first-listed options in pairwise comparisons).
Building with LLM-as-judge?
We ship production AI systems built around concepts like this every quarter. Send a brief and get a written proposal in 48 hours.
Send a brief →