← All free toolsFree · live data, 2026

LLM model comparison.

Frontier and production models from Anthropic, OpenAI, Google, and Meta side by side. Sort by price or context. Filter by provider, tier, modality. Updated quarterly.

Provider

Tier

Modality

Sort by

ModelTierInput $/1MOutput $/1MCachedContextLatencyModalities

Gemini 2.5 Flash

Google

Cheapest multi-modal model — vision/audio work at scale

Small$0.30$2.50$0.071000k400ms
TextVisionAudio

GPT-5 Nano

OpenAI

Cheapest viable production model — bulk classification, summary

Small$0.50$2.00$0.05200k250ms
Text

Claude Haiku 4.5

Anthropic

Classification, routing, lightweight extraction at scale

Small$0.80$4.00$0.08200k350ms
TextVision

Gemini 2.5 Pro

Google

Massive context windows (long docs, video), best price/perf in mid-tier

Mid$1.25$5.00$0.302000k1100ms
TextVisionAudio

GPT-5 Mini

OpenAI

Workhorse for chat-style production at OpenAI cost economics

Mid$2.00$10.00$0.20400k700ms
TextVision

Claude Sonnet 4.6

Anthropic

Default for most production workloads — chat, RAG, coding

Mid$3.00$15.00$0.30200k900ms
TextVision

Llama 4 405B

Meta

Open-weight option for self-hosting, fine-tuning, or sovereignty needs

Frontier$3.00$9.00$0.60128k1500ms
Text

GPT-5

OpenAI

Multi-modal inputs, longest context, reasoning-heavy work

Frontier$12.00$60.00$1.20400k1800ms
TextVisionAudio

Claude Opus 4.7

Anthropic

Hardest reasoning, multi-step planning, agent loops with high stakes

Frontier$15.00$75.00$1.50200k2200ms
TextVision
How to read this

Prices are per million tokens in USD, approximate as of early 2026. Cached input is what you pay for tokens served from the provider’s prompt cache.

Context is the maximum input tokens the model accepts. Latency is rough p50 first- token time on a typical 4k-token prompt — actual numbers depend on region, load, streaming, and structured-output mode.

Best-for tag is opinionated and based on what we actually pick in production engagements.

Need help picking?

We pick models for a living. A Discovery Sprint includes a benchmarked recommendation against your actual eval set, not a vendor-blog comparison.

Send a brief →