← All terms
Models
Multi-modal
Models that accept inputs beyond text — images, audio, video, PDF.
Modern frontier models accept images, audio, and (increasingly) video as inputs alongside text. In production this enables document understanding (no separate OCR), visual QA, voice agents, and image-grounded chat. Each modality has its own pricing and latency profile; budget accordingly.
Related terms
Building with Multi-modal?
We ship production AI systems built around concepts like this every quarter. Send a brief and get a written proposal in 48 hours.
Send a brief →