← All free toolsFree · live calculator

How much will your AI cost?

Real 2026 per-million-token prices for Claude, GPT-5, and Gemini. Real production assumptions. A monthly estimate in 30 seconds — plus opinionated recommendations on where you’re probably overpaying.

Start from a preset

Your workload

Requests per day5,000

Inference calls to the LLM, averaged over a typical day

type a precise value

Input tokens per request1,500 tok

System prompt + user input + retrieved context (1 token ≈ 0.75 words)

type a precise value

Output tokens per request350 tok

What the model actually generates back

type a precise value

Model tier

FrontierClaude Opus 4.7 · GPT-5 · Gemini 2.5 UltraMid-tierClaude Sonnet 4.6 · GPT-5 Mini · Gemini 2.5 ProSmall / fastClaude Haiku 4.5 · GPT-5 Nano · Gemini 2.5 Flash

Prompt caching

No prompt cachingYou pay full price for every input token, every request.Some caching (system prompt)System prompt cached; 50% of input is at the cached rate.Aggressive cachingSystem prompt + retrieved docs cached. 85% of input at the cached rate.

Retrieval

No retrievalPure prompting; no knowledge-base lookups.Vector RAGEach request embeds the query; retrieval cost is mostly embedding tokens.Hybrid retrievalBM25 + dense + re-ranking. Slightly more embedding work per query.

Estimated cost

$973 / mo

$11,675 / year · $0.006 per request

Input / mo

228.0M

Output / mo

53.2M

Embedded / mo

30.4M

Breakdown

Input tokens (uncached)$285
Input tokens (cached)$45.60
Output tokens$638
Embedding for retrieval$3.95

Optimized estimate

$295 / mo

↓ $678 / mo with the recommendations below

Send this to hello@mini-trends.com →

Pre-fills the email with your inputs and result

Where you’re probably overpaying

Heuristics from running this exact optimization on dozens of production systems. Most clients realize 40–80% of these savings within a single sprint.

Route easy calls to a small model

save ~$365/mo

Classification, routing, intent detection, and lightweight extraction can almost always run on Haiku / Nano / Flash without quality loss. Build a router and split traffic.

Cache more aggressively

save ~$168/mo

You are caching the system prompt. The next-largest stable region — retrieved-doc context — can usually be cached too. Look at your prompt structure for cache-friendly ordering.

Use batch APIs for non-realtime work

save ~$146/mo

For evals, summarization jobs, and async pipelines, batch APIs from Anthropic / OpenAI / Google price 50% off. If even 30% of your traffic is non-interactive, this stacks with the other savings.

How this calculator works (and what it doesn’t cover)

Token costs are computed against per-million-token prices roughly representative of frontier, mid-tier, and small models from Anthropic, OpenAI, and Google as of early 2026. Cached input is priced at the provider’s reduced cached-read rate. Embeddings use a typical $0.13/1M-token rate.

Not included: vector-database hosting, GPU/inference if self-hosting, observability infra, evaluation runs, agentic-loop multipliers (an agent that iterates 3× pays 3× the per-request cost shown here), or human review costs. In real engagements we model these explicitly — this calculator is a first-order estimate to anchor the conversation, not the proposal itself.

Want this done on your real system?

We do AI cost optimization as a Discovery Sprint.

Two weeks, fixed price ($35k). We instrument your production system, identify the real savings, ship the changes, and hand you a written report. Most engagements pay back the engagement fee in under three months.

Send a brief based on this estimate See pricing