Eval drift is the silent degradation of agent output quality over time, even when your code and prompts haven't changed. It's caused by upstream model updates, shifting input distributions, and environmental changes that alter how your agent behaves without any visible change in your codebase.

How do you detect eval drift?

By running the same evaluation rules on every output continuously and tracking scores over time. When average scores trend downward without code changes, that's drift. Iris scores every output inline, making drift visible in the dashboard as a trend rather than discovered after user complaints.

Eval Drift

Q: What causes eval drift?

Three primary causes: (1) upstream model updates — providers update models without notice, changing output behavior; (2) input distribution shift — real-world usage patterns differ from development data; (3) environmental changes — API rate limits, context window changes, and dependency updates that affect agent behavior indirectly.

The silent quality killer — when agent output degrades without any code changes.

Definition#

Definition

Eval driftis the silent degradation of agent output quality over time, even when your code and prompts haven't changed. Upstream model updates, shifting input distributions, and environmental changes alter agent behavior without any visible change in your codebase. Without continuous evaluation, drift goes undetected until users report failures.

What Causes Eval Drift#

Model Updates

Providers update models without notice. GPT-4 today is not GPT-4 from last month. Same API, different behavior.

Input Shift

Real-world usage patterns differ from development. Edge cases accumulate. The distribution your prompts were tuned for drifts.

Environment Changes

API rate limits, context window changes, dependency updates, and infrastructure shifts that affect behavior indirectly.

Detection#

Eval drift is invisible without continuous scoring. The only way to detect it is to run the same evaluation rules on every output and track scores over time. When average quality scores trend downward without code changes, that's drift. A single failing output might be noise — a downward trend over days or weeks is signal.

Key Data

Eval drift is the #1 reason agents that work in demos fail in production. The demo environment is frozen. Production is not.

How Iris Helps#

Iris scores every output inline, making drift visible as a trend in the dashboard. When scores drop over time, you see it immediately — not weeks later from user complaints. The "All Time" period view lets you compare current performance against historical baselines.

Read the deep dive: Eval Drift →

TERM

Self-Calibrating Eval

The pattern that solves drift — eval rules that adapt thresholds based on observed distributions.

TERM

The Eval Loop

The continuous cycle that catches drift: score, diagnose, calibrate, re-score.

TERM

The Eval Tax

Drift compounds the eval tax — undetected quality degradation increases costs over time.

TERM

Agent Eval