LLM model comparison.
Frontier and production models from Anthropic, OpenAI, Google, and Meta side by side. Sort by price or context. Filter by provider, tier, modality. Updated quarterly.
Provider
Tier
Modality
Sort by
| Model | Tier | Input $/1M | Output $/1M | Cached | Context | Latency | Modalities |
|---|---|---|---|---|---|---|---|
Gemini 2.5 Flash Cheapest multi-modal model — vision/audio work at scale | Small | $0.30 | $2.50 | $0.07 | 1000k | 400ms | TextVisionAudio |
GPT-5 Nano OpenAI Cheapest viable production model — bulk classification, summary | Small | $0.50 | $2.00 | $0.05 | 200k | 250ms | Text |
Claude Haiku 4.5 Anthropic Classification, routing, lightweight extraction at scale | Small | $0.80 | $4.00 | $0.08 | 200k | 350ms | TextVision |
Gemini 2.5 Pro Massive context windows (long docs, video), best price/perf in mid-tier | Mid | $1.25 | $5.00 | $0.30 | 2000k | 1100ms | TextVisionAudio |
GPT-5 Mini OpenAI Workhorse for chat-style production at OpenAI cost economics | Mid | $2.00 | $10.00 | $0.20 | 400k | 700ms | TextVision |
Claude Sonnet 4.6 Anthropic Default for most production workloads — chat, RAG, coding | Mid | $3.00 | $15.00 | $0.30 | 200k | 900ms | TextVision |
Llama 4 405B Meta Open-weight option for self-hosting, fine-tuning, or sovereignty needs | Frontier | $3.00 | $9.00 | $0.60 | 128k | 1500ms | Text |
GPT-5 OpenAI Multi-modal inputs, longest context, reasoning-heavy work | Frontier | $12.00 | $60.00 | $1.20 | 400k | 1800ms | TextVisionAudio |
Claude Opus 4.7 Anthropic Hardest reasoning, multi-step planning, agent loops with high stakes | Frontier | $15.00 | $75.00 | $1.50 | 200k | 2200ms | TextVision |
How to read this
Prices are per million tokens in USD, approximate as of early 2026. Cached input is what you pay for tokens served from the provider’s prompt cache.
Context is the maximum input tokens the model accepts. Latency is rough p50 first- token time on a typical 4k-token prompt — actual numbers depend on region, load, streaming, and structured-output mode.
Best-for tag is opinionated and based on what we actually pick in production engagements.
Need help picking?
We pick models for a living. A Discovery Sprint includes a benchmarked recommendation against your actual eval set, not a vendor-blog comparison.
Send a brief →