v0.1Iris — The agent eval standard for MCP. 12 eval rules, open source

The Eval Gap

Why your AI demo works and production doesn't — and how to close the distance.

Definition#

Definition

The eval gap is the difference between how an AI agent performs in demos and development versus how it performs in production. Curated inputs, controlled environments, small sample sizes, and implicit human oversight during development create an illusion of reliability that evaporates under real-world conditions.

Four Mechanisms That Create the Gap#

Curated Inputs

Demos use carefully chosen examples. Production receives whatever users send — typos, edge cases, adversarial inputs, languages you didn't test.

Controlled Environment

Development has fast APIs, unlimited rate limits, and the latest model. Production has latency spikes, rate limiting, and model versions you don't control.

Small Sample Size

10 demo runs look great. 10,000 production runs reveal the long tail of failures that sampling never catches.

Implicit Oversight

During demos, humans review every output. In production, agents run autonomously. The human safety net disappears.

Closing the Gap#

The eval gap closes when you run the same evaluation in production that you run in development — on every output, not just test cases. Inline eval makes production performance visible in real-time. The gap between demo and reality becomes measurable, and measurable problems are solvable.

How Iris Helps#

Iris evaluates every production output with the same rules you define in development. No gap between test-time scoring and production scoring — same rules, same thresholds, every execution. The dashboard shows you production quality in real-time, making the eval gap visible and quantifiable.

Read the deep dive: The Eval Gap →

Frequently Asked Questions#