Models

Multi-modal

Models that accept inputs beyond text — images, audio, video, PDF.

Modern frontier models accept images, audio, and (increasingly) video as inputs alongside text. In production this enables document understanding (no separate OCR), visual QA, voice agents, and image-grounded chat. Each modality has its own pricing and latency profile; budget accordingly.

Related terms

Whisper
Voice agent

Building with Multi-modal?

We ship production AI systems built around concepts like this every quarter. Send a brief and get a written proposal in 48 hours.

Send a brief →