Technical ConceptsFoundationalDraft · pending human review

Latency

How fast an AI system responds — and why it determines whether a model that works in theory is usable in practice.

Latency is the time between sending a request to an AI system and receiving a response. It's influenced by model size, input length, infrastructure location, concurrent load, and network conditions. For batch processing or overnight jobs, latency rarely matters. For real-time applications — fraud scoring at checkout, live customer support, voice interfaces — it's a hard constraint. A model that takes three seconds to respond to a text-editor prompt is functionally unusable, regardless of how accurate its suggestions are.

Latency requirements should define model selection, not the other way around. Teams that choose the most capable model and discover the latency problem post-deployment face an expensive rebuild. Latency also varies in production in ways that testing under low load won't reveal: the 99th-percentile response time under peak traffic is what determines whether a customer-facing feature is reliable or frustrating. Average latency conceals the tail experience that drives abandonment and complaints.

Related concepts

Technical Concepts

Optional map

Concept neighborhood

Focused neighborhood

Latency

How fast an AI system responds — and why it determines whether a model that works in theory is usable in practice.

In these paths

Self-Directed

Selected concept

Directly related

One step further

via Inference

via Large Language Models

via Cloud AI

via Parameters