Technical ConceptsIntermediateDraft · pending human review

Inference

Where training ends and the model starts doing actual work — producing outputs on real inputs, in real time.

Inference is the stage where a trained model is applied to new inputs to produce outputs. It's what happens when a model scores a loan application, classifies a support ticket, generates a summary, or responds to a user prompt. Training happens once (or periodically) to create the model; inference happens continuously — every time someone or something uses it. The computational demands, costs, and performance characteristics of inference are different from training, which is why a model that works in development may behave differently at production scale.

Inference is where AI investment becomes operational cost and risk. Cost per request, response latency, reliability under load, and failure behavior all become real constraints that didn't exist in a demo environment. A model that performs well in a development notebook, tested by one engineer with hand-crafted inputs, can look very different when it's handling thousands of concurrent users with messy real-world data. Production inference requires cost modeling, latency budgets, fallback behavior, and monitoring — not just model accuracy.

Related concepts

Foundations

Optional map

Concept neighborhood

Focused neighborhood

Inference

Where training ends and the model starts doing actual work — producing outputs on real inputs, in real time.

In these paths

Self-Directed

Selected concept

Directly related

One step further

via Model

via Latency

via Model Deployment

via Context Window