What is the eval gap?

The eval gap is the difference between how an AI agent performs in demos and development versus how it performs in production with real users, real data, and real edge cases. It's the distance between 'it works on my machine' and 'it works for our customers.'

Why does the eval gap exist?

Four mechanisms: (1) curated demo inputs vs messy real inputs, (2) controlled environment vs production variability, (3) small sample size vs scale effects, (4) implicit human oversight during demos that doesn't exist in production.

How do you close the eval gap?

By running the same evaluation in production that you run in development — on every output, not just test cases. When eval is inline and continuous, production performance is visible in real-time, and the gap between demo and reality becomes measurable and addressable.

The Eval Gap

Why your AI demo works and production doesn't — and how to close the distance.

Definition#

Definition

The eval gap is the difference between how an AI agent performs in demos and development versus how it performs in production. Curated inputs, controlled environments, small sample sizes, and implicit human oversight during development create an illusion of reliability that evaporates under real-world conditions.

Four Mechanisms That Create the Gap#

Curated Inputs

Demos use carefully chosen examples. Production receives whatever users send — typos, edge cases, adversarial inputs, languages you didn't test.

Controlled Environment

Development has fast APIs, unlimited rate limits, and the latest model. Production has latency spikes, rate limiting, and model versions you don't control.

Small Sample Size

10 demo runs look great. 10,000 production runs reveal the long tail of failures that sampling never catches.

Implicit Oversight

During demos, humans review every output. In production, agents run autonomously. The human safety net disappears.

Closing the Gap#

The eval gap closes when you run the same evaluation in production that you run in development — on every output, not just test cases. Inline eval makes production performance visible in real-time. The gap between demo and reality becomes measurable, and measurable problems are solvable.

How Iris Helps#

Iris evaluates every production output with the same rules you define in development. No gap between test-time scoring and production scoring — same rules, same thresholds, every execution. The dashboard shows you production quality in real-time, making the eval gap visible and quantifiable.

Read the deep dive: The Eval Gap →

TERM

The Eval Tax

The eval gap is where the eval tax accumulates fastest.

TERM

Eval-Driven Development

Define eval rules before prompts — closes the gap by design.

TERM

Eval Coverage

100% coverage means no output goes unscored — the gap becomes visible.

TERM

Agent Eval