← All free toolsFree · scoping calculator
Eval harness scoping.
How big should your eval set be, and how long does it take to build? Realistic estimates from production engagements — labeling time, engineer time, total cost.
8
Each one needs its own eval coverage
40
50 is a typical floor for a real signal; 100+ for production-critical paths
50%
Judge cases are dramatically cheaper but need rubric calibration first
How this is calculated
Time-per-case multipliers are based on what we have measured across actual engagements. Labeling rate assumed at $150/hr (a senior analyst or domain expert); engineer time at $250/hr.
Real engagements vary. The point of this calculator is not to produce the SOW — it is to anchor the conversation about scope before you commit to “we’ll add evals later” and ship without them.