Generative AIIntermediateDraft · pending human review

Multimodal AI

AI that can work with more than text — reading images, processing audio, interpreting video — which opens new capabilities and introduces risks that vary sharply by what it's perceiving.

Multimodal AI refers to systems that can process and generate more than one type of data — text, images, audio, video, documents, and sometimes sensor data — within the same model or workflow. Rather than using separate specialized tools for each format, multimodal systems can analyze a chart and describe it in words, transcribe a meeting and extract action items, read a handwritten form, or respond to a spoken prompt. The capability is genuinely expanding what AI can automate, particularly in workflows where information arrives in mixed formats.

Each modality carries its own accuracy profile, privacy exposure, and compliance requirements. An AI that analyzes images can misidentify objects, products, or people. One that processes audio can misattribute statements or mishandle consent around recorded conversations. Systems that handle medical images, biometric data, or documents containing personal information are subject to regulatory requirements that text-only systems often aren't. Organizations evaluating multimodal AI should assess the risks of each input type separately — not just whether the overall capability is impressive, but whether its specific failure modes are acceptable for the workflow it's being applied to.

Related concepts

Generative AI

Generative AI produces new content—text, images, code, summaries, audio—on demand, based on patterns learned from vast amounts of existing data.

Generative AI

Foundation Models

Large, general-purpose AI models trained on vast data — the shared starting point that organizations adapt rather than build from scratch.

Generative AI

Text-to-Image Models

AI that generates images from text descriptions — genuinely useful for creative work, and carrying unsettled intellectual property and brand governance questions most organizations haven't resolved.

Explore the concept map →

Optional map

Concept neighborhood

Focused neighborhood

Multimodal AI

AI that can work with more than text — reading images, processing audio, interpreting video — which opens new capabilities and introduces risks that vary sharply by what it's perceiving.

In these paths

Self-Directed

Selected concept

Directly related

One step further

via Generative AI

via Foundation Models

via Text-to-Image Models

via Speech to Text