Multimodal AI
AI that can work with more than text — reading images, processing audio, interpreting video — which opens new capabilities and introduces risks that vary sharply by what it's perceiving.
Multimodal AI refers to systems that can process and generate more than one type of data — text, images, audio, video, documents, and sometimes sensor data — within the same model or workflow. Rather than using separate specialized tools for each format, multimodal systems can analyze a chart and describe it in words, transcribe a meeting and extract action items, read a handwritten form, or respond to a spoken prompt. The capability is genuinely expanding what AI can automate, particularly in workflows where information arrives in mixed formats.
Each modality carries its own accuracy profile, privacy exposure, and compliance requirements. An AI that analyzes images can misidentify objects, products, or people. One that processes audio can misattribute statements or mishandle consent around recorded conversations. Systems that handle medical images, biometric data, or documents containing personal information are subject to regulatory requirements that text-only systems often aren't. Organizations evaluating multimodal AI should assess the risks of each input type separately — not just whether the overall capability is impressive, but whether its specific failure modes are acceptable for the workflow it's being applied to.
Read next
Related concepts
Generative AI
Generative AI produces new content—text, images, code, summaries, audio—on demand, based on patterns learned from vast amounts of existing data.
Generative AIFoundation Models
Large, general-purpose AI models trained on vast data — the shared starting point that organizations adapt rather than build from scratch.
Generative AIText-to-Image Models
AI that generates images from text descriptions — genuinely useful for creative work, and carrying unsettled intellectual property and brand governance questions most organizations haven't resolved.
Optional map
Concept neighborhood
Focused neighborhood
Multimodal AI
AI that can work with more than text — reading images, processing audio, interpreting video — which opens new capabilities and introduces risks that vary sharply by what it's perceiving.
In these paths
Selected concept
Directly related
One step further
via Generative AI
via Foundation Models
via Text-to-Image Models
via Speech to Text