v0.1Iris — The agent eval standard for MCP. 12 eval rules, open source

Eval Drift

The silent quality killer — when agent output degrades without any code changes.

Definition#

Definition

Eval driftis the silent degradation of agent output quality over time, even when your code and prompts haven't changed. Upstream model updates, shifting input distributions, and environmental changes alter agent behavior without any visible change in your codebase. Without continuous evaluation, drift goes undetected until users report failures.

What Causes Eval Drift#

Model Updates

Providers update models without notice. GPT-4 today is not GPT-4 from last month. Same API, different behavior.

Input Shift

Real-world usage patterns differ from development. Edge cases accumulate. The distribution your prompts were tuned for drifts.

Environment Changes

API rate limits, context window changes, dependency updates, and infrastructure shifts that affect behavior indirectly.

Detection#

Eval drift is invisible without continuous scoring. The only way to detect it is to run the same evaluation rules on every output and track scores over time. When average quality scores trend downward without code changes, that's drift. A single failing output might be noise — a downward trend over days or weeks is signal.

Key Data

Eval drift is the #1 reason agents that work in demos fail in production. The demo environment is frozen. Production is not.

How Iris Helps#

Iris scores every output inline, making drift visible as a trend in the dashboard. When scores drop over time, you see it immediately — not weeks later from user complaints. The "All Time" period view lets you compare current performance against historical baselines.

Read the deep dive: Eval Drift →

Frequently Asked Questions#