← All free toolsFree · live data, 2026

LLM model comparison.

Frontier and production models from Anthropic, OpenAI, Google, and Meta side by side. Sort by price or context. Filter by provider, tier, modality. Updated quarterly.

Provider

Tier

Modality

Sort by

Model	Tier	Input $/1M	Output $/1M	Cached	Context	Latency	Modalities
Gemini 2.5 Flash Google Cheapest multi-modal model — vision/audio work at scale	Small	$0.30	$2.50	$0.07	1000k	400ms	TextVisionAudio
GPT-5 Nano OpenAI Cheapest viable production model — bulk classification, summary	Small	$0.50	$2.00	$0.05	200k	250ms	Text
Claude Haiku 4.5 Anthropic Classification, routing, lightweight extraction at scale	Small	$0.80	$4.00	$0.08	200k	350ms	TextVision
Gemini 2.5 Pro Google Massive context windows (long docs, video), best price/perf in mid-tier	Mid	$1.25	$5.00	$0.30	2000k	1100ms	TextVisionAudio
GPT-5 Mini OpenAI Workhorse for chat-style production at OpenAI cost economics	Mid	$2.00	$10.00	$0.20	400k	700ms	TextVision
Claude Sonnet 4.6 Anthropic Default for most production workloads — chat, RAG, coding	Mid	$3.00	$15.00	$0.30	200k	900ms	TextVision
Llama 4 405B Meta Open-weight option for self-hosting, fine-tuning, or sovereignty needs	Frontier	$3.00	$9.00	$0.60	128k	1500ms	Text
GPT-5 OpenAI Multi-modal inputs, longest context, reasoning-heavy work	Frontier	$12.00	$60.00	$1.20	400k	1800ms	TextVisionAudio
Claude Opus 4.7 Anthropic Hardest reasoning, multi-step planning, agent loops with high stakes	Frontier	$15.00	$75.00	$1.50	200k	2200ms	TextVision

How to read this

Prices are per million tokens in USD, approximate as of early 2026. Cached input is what you pay for tokens served from the provider’s prompt cache.

Context is the maximum input tokens the model accepts. Latency is rough p50 first- token time on a typical 4k-token prompt — actual numbers depend on region, load, streaming, and structured-output mode.

Best-for tag is opinionated and based on what we actually pick in production engagements.

Need help picking?

We pick models for a living. A Discovery Sprint includes a benchmarked recommendation against your actual eval set, not a vendor-blog comparison.

Send a brief →