Model Evaluation
How teams determine whether a model actually works — and the reason 'it works in testing' is often the most dangerous thing anyone says before launch.
Model evaluation is the process of systematically testing a model's performance before — and after — it goes into production. It asks: does the model do what it's supposed to do, on the data it will actually see, across the populations it will actually affect? Evaluation goes beyond accuracy metrics: it includes whether the model is reliable on edge cases, whether performance holds across different demographic groups, whether outputs are calibrated (not just correct on average), and whether improvements in technical metrics actually translate to better business outcomes.
Most model failures that cause real problems weren't surprises — they were visible in the evaluation data and either missed or dismissed. A model evaluated only on the clean, representative data used in development will look much better than it performs on live traffic. A model evaluated only on aggregate accuracy will look better than it performs for the subgroups that matter for fairness or regulatory compliance. Weak evaluation is how organizations end up explaining to regulators, press, or customers why a system that "performed well in testing" produced discriminatory or incorrect results in the real world. Evaluation is a governance function, not just a technical one.
Read next
Related concepts
Model
The learned component at the core of an AI system — what turns inputs into predictions, decisions, or generated content.
Technical ConceptsAccuracy
The most widely reported AI performance metric — and one of the easiest to be misled by.
Technical ConceptsPrecision and Recall
The two metrics that capture how a model fails — flagging too many false alarms versus missing too many real cases — and why choosing between them is a business decision, not a technical one.
Optional map
Concept neighborhood
Focused neighborhood
Model Evaluation
How teams determine whether a model actually works — and the reason 'it works in testing' is often the most dangerous thing anyone says before launch.
In these paths
Selected concept
Directly related
One step further
via Model
via Accuracy
via Model Monitoring