Self-Calibrating Eval
Eval rules that know when their own thresholds are wrong — and tell you how to fix them.
Definition#
Definition
Why Static Thresholds Break#
A completeness threshold set at 0.7 might be perfect today — passing genuinely good outputs and catching bad ones. Three weeks later, the same threshold passes everything (because model quality improved) or fails everything (because input patterns shifted). The threshold didn't change. The world around it did. This is eval drift at the threshold level.
Key Data
The Pattern#
Monitor
Track the distribution of scores for each eval rule over time.
Detect
Flag when pass rates hit extremes (100% or 0%) or distributions shift significantly.
Recommend
Suggest adjusted thresholds based on observed data. Human approves or rejects.
How Iris Helps#
Iris provides the scoring data that powers self-calibrating eval. Every output is scored with the same rules, building the distribution data needed to detect calibration issues. The dashboard shows pass/fail rates over time — when a rule passes 100% of outputs, it's visible immediately.
Read the deep dive: Self-Calibrating Eval →
Related Concepts#
Eval Drift
Self-calibrating eval is the solution to threshold-level drift.
The Eval Loop
Self-calibration is the 'Calibrate' stage of the eval loop, automated.
Output Quality Score
A composite metric that benefits from calibrated individual rules.
Agent Eval
The complete guide to evaluating AI agent outputs.