Evaluation

LLM-as-judge

Using a model to evaluate the outputs of another model.

LLM-as-judge uses a (usually frontier) model to score the outputs of your production model against a rubric you provide. It is the standard pattern for evaluating open-ended generation tasks where deterministic scoring is impossible. Calibrate the judge against human ratings on a sample before trusting its scores at scale, and watch for known biases (judges prefer longer outputs, prefer first-listed options in pairwise comparisons).

Building with LLM-as-judge?

We ship production AI systems built around concepts like this every quarter. Send a brief and get a written proposal in 48 hours.

Send a brief →