Latency
How fast an AI system responds — and why it determines whether a model that works in theory is usable in practice.
Latency is the time between sending a request to an AI system and receiving a response. It's influenced by model size, input length, infrastructure location, concurrent load, and network conditions. For batch processing or overnight jobs, latency rarely matters. For real-time applications — fraud scoring at checkout, live customer support, voice interfaces — it's a hard constraint. A model that takes three seconds to respond to a text-editor prompt is functionally unusable, regardless of how accurate its suggestions are.
Latency requirements should define model selection, not the other way around. Teams that choose the most capable model and discover the latency problem post-deployment face an expensive rebuild. Latency also varies in production in ways that testing under low load won't reveal: the 99th-percentile response time under peak traffic is what determines whether a customer-facing feature is reliable or frustrating. Average latency conceals the tail experience that drives abandonment and complaints.
Read next
Related concepts
Inference
Where training ends and the model starts doing actual work — producing outputs on real inputs, in real time.
Generative AILarge Language Models
The AI models behind most generative tools today — capable of remarkable language tasks, and unreliable about facts they were never trained on.
Technical ConceptsCloud AI
AI capabilities delivered as a service — powerful and accessible, with vendor dependency and data governance strings attached.
Optional map
Concept neighborhood
Focused neighborhood
Latency
How fast an AI system responds — and why it determines whether a model that works in theory is usable in practice.
In these paths
Selected concept
Directly related
One step further
via Inference
via Large Language Models
via Cloud AI
via Parameters