The Eval Gap
Why your AI demo works and production doesn't — and how to close the distance.
Definition#
Definition
Four Mechanisms That Create the Gap#
Curated Inputs
Demos use carefully chosen examples. Production receives whatever users send — typos, edge cases, adversarial inputs, languages you didn't test.
Controlled Environment
Development has fast APIs, unlimited rate limits, and the latest model. Production has latency spikes, rate limiting, and model versions you don't control.
Small Sample Size
10 demo runs look great. 10,000 production runs reveal the long tail of failures that sampling never catches.
Implicit Oversight
During demos, humans review every output. In production, agents run autonomously. The human safety net disappears.
Closing the Gap#
The eval gap closes when you run the same evaluation in production that you run in development — on every output, not just test cases. Inline eval makes production performance visible in real-time. The gap between demo and reality becomes measurable, and measurable problems are solvable.
How Iris Helps#
Iris evaluates every production output with the same rules you define in development. No gap between test-time scoring and production scoring — same rules, same thresholds, every execution. The dashboard shows you production quality in real-time, making the eval gap visible and quantifiable.
Read the deep dive: The Eval Gap →
Related Concepts#
The Eval Tax
The eval gap is where the eval tax accumulates fastest.
Eval-Driven Development
Define eval rules before prompts — closes the gap by design.
Eval Coverage
100% coverage means no output goes unscored — the gap becomes visible.
Agent Eval
The complete guide to evaluating AI agent outputs.