Operations and DeploymentIntermediateDraft · pending human review

Model Evaluation

How teams determine whether a model actually works — and the reason 'it works in testing' is often the most dangerous thing anyone says before launch.

Model evaluation is the process of systematically testing a model's performance before — and after — it goes into production. It asks: does the model do what it's supposed to do, on the data it will actually see, across the populations it will actually affect? Evaluation goes beyond accuracy metrics: it includes whether the model is reliable on edge cases, whether performance holds across different demographic groups, whether outputs are calibrated (not just correct on average), and whether improvements in technical metrics actually translate to better business outcomes.

Most model failures that cause real problems weren't surprises — they were visible in the evaluation data and either missed or dismissed. A model evaluated only on the clean, representative data used in development will look much better than it performs on live traffic. A model evaluated only on aggregate accuracy will look better than it performs for the subgroups that matter for fairness or regulatory compliance. Weak evaluation is how organizations end up explaining to regulators, press, or customers why a system that "performed well in testing" produced discriminatory or incorrect results in the real world. Evaluation is a governance function, not just a technical one.

Related concepts

Foundations

Optional map

Concept neighborhood

Focused neighborhood