AI forExecutives
Generative AIIntermediateDraft · pending human review

Multimodal AI

AI that can work with more than text — reading images, processing audio, interpreting video — which opens new capabilities and introduces risks that vary sharply by what it's perceiving.

Multimodal AI refers to systems that can process and generate more than one type of data — text, images, audio, video, documents, and sometimes sensor data — within the same model or workflow. Rather than using separate specialized tools for each format, multimodal systems can analyze a chart and describe it in words, transcribe a meeting and extract action items, read a handwritten form, or respond to a spoken prompt. The capability is genuinely expanding what AI can automate, particularly in workflows where information arrives in mixed formats.

Each modality carries its own accuracy profile, privacy exposure, and compliance requirements. An AI that analyzes images can misidentify objects, products, or people. One that processes audio can misattribute statements or mishandle consent around recorded conversations. Systems that handle medical images, biometric data, or documents containing personal information are subject to regulatory requirements that text-only systems often aren't. Organizations evaluating multimodal AI should assess the risks of each input type separately — not just whether the overall capability is impressive, but whether its specific failure modes are acceptable for the workflow it's being applied to.

Read next

Related concepts

Optional map

Concept neighborhood

Focused neighborhood

Multimodal AI

AI that can work with more than text — reading images, processing audio, interpreting video — which opens new capabilities and introduces risks that vary sharply by what it's perceiving.

In these paths

Selected concept

Directly related

One step further

via Generative AI

via Foundation Models

via Text-to-Image Models

via Speech to Text