Route easy calls to a small model
save ~$365/moClassification, routing, intent detection, and lightweight extraction can almost always run on Haiku / Nano / Flash without quality loss. Build a router and split traffic.
Real 2026 per-million-token prices for Claude, GPT-5, and Gemini. Real production assumptions. A monthly estimate in 30 seconds — plus opinionated recommendations on where you’re probably overpaying.
Start from a preset
Inference calls to the LLM, averaged over a typical day
System prompt + user input + retrieved context (1 token ≈ 0.75 words)
What the model actually generates back
Estimated cost
$973 / mo
$11,675 / year · $0.006 per request
Input / mo
228.0M
Output / mo
53.2M
Embedded / mo
30.4M
Breakdown
Optimized estimate
$295 / mo
↓ $678 / mo with the recommendations below
Pre-fills the email with your inputs and result
Heuristics from running this exact optimization on dozens of production systems. Most clients realize 40–80% of these savings within a single sprint.
Classification, routing, intent detection, and lightweight extraction can almost always run on Haiku / Nano / Flash without quality loss. Build a router and split traffic.
You are caching the system prompt. The next-largest stable region — retrieved-doc context — can usually be cached too. Look at your prompt structure for cache-friendly ordering.
For evals, summarization jobs, and async pipelines, batch APIs from Anthropic / OpenAI / Google price 50% off. If even 30% of your traffic is non-interactive, this stacks with the other savings.
Token costs are computed against per-million-token prices roughly representative of frontier, mid-tier, and small models from Anthropic, OpenAI, and Google as of early 2026. Cached input is priced at the provider’s reduced cached-read rate. Embeddings use a typical $0.13/1M-token rate.
Not included: vector-database hosting, GPU/inference if self-hosting, observability infra, evaluation runs, agentic-loop multipliers (an agent that iterates 3× pays 3× the per-request cost shown here), or human review costs. In real engagements we model these explicitly — this calculator is a first-order estimate to anchor the conversation, not the proposal itself.
Want this done on your real system?
Two weeks, fixed price ($35k). We instrument your production system, identify the real savings, ship the changes, and hand you a written report. Most engagements pay back the engagement fee in under three months.