Inference
Where training ends and the model starts doing actual work — producing outputs on real inputs, in real time.
Inference is the stage where a trained model is applied to new inputs to produce outputs. It's what happens when a model scores a loan application, classifies a support ticket, generates a summary, or responds to a user prompt. Training happens once (or periodically) to create the model; inference happens continuously — every time someone or something uses it. The computational demands, costs, and performance characteristics of inference are different from training, which is why a model that works in development may behave differently at production scale.
Inference is where AI investment becomes operational cost and risk. Cost per request, response latency, reliability under load, and failure behavior all become real constraints that didn't exist in a demo environment. A model that performs well in a development notebook, tested by one engineer with hand-crafted inputs, can look very different when it's handling thousands of concurrent users with messy real-world data. Production inference requires cost modeling, latency budgets, fallback behavior, and monitoring — not just model accuracy.
Read next
Related concepts
Model
The learned component at the core of an AI system — what turns inputs into predictions, decisions, or generated content.
Technical ConceptsLatency
How fast an AI system responds — and why it determines whether a model that works in theory is usable in practice.
Operations and DeploymentModel Deployment
The step where a trained model stops being a proof of concept and starts affecting real decisions — and where most AI projects either succeed or quietly fail.
Optional map
Concept neighborhood
Focused neighborhood
Inference
Where training ends and the model starts doing actual work — producing outputs on real inputs, in real time.
In these paths
Selected concept
Directly related
One step further
via Model
via Latency
via Model Deployment
via Context Window