Eval Drift
The silent quality killer — when agent output degrades without any code changes.
Definition#
Definition
What Causes Eval Drift#
Model Updates
Providers update models without notice. GPT-4 today is not GPT-4 from last month. Same API, different behavior.
Input Shift
Real-world usage patterns differ from development. Edge cases accumulate. The distribution your prompts were tuned for drifts.
Environment Changes
API rate limits, context window changes, dependency updates, and infrastructure shifts that affect behavior indirectly.
Detection#
Eval drift is invisible without continuous scoring. The only way to detect it is to run the same evaluation rules on every output and track scores over time. When average quality scores trend downward without code changes, that's drift. A single failing output might be noise — a downward trend over days or weeks is signal.
Key Data
How Iris Helps#
Iris scores every output inline, making drift visible as a trend in the dashboard. When scores drop over time, you see it immediately — not weeks later from user complaints. The "All Time" period view lets you compare current performance against historical baselines.
Read the deep dive: Eval Drift →
Related Concepts#
Self-Calibrating Eval
The pattern that solves drift — eval rules that adapt thresholds based on observed distributions.
The Eval Loop
The continuous cycle that catches drift: score, diagnose, calibrate, re-score.
The Eval Tax
Drift compounds the eval tax — undetected quality degradation increases costs over time.
Agent Eval
The complete guide to evaluating AI agent outputs.