Why do eval thresholds need calibration?

Static thresholds break over time. A completeness threshold set at 0.7 might pass everything today and fail everything next month — not because quality changed, but because the distribution shifted. Self-calibrating eval detects this automatically.

Is self-calibrating eval fully automated?

No — and intentionally so. The system recommends adjustments but a human approves them. This keeps humans in the loop for quality decisions while eliminating the manual work of monitoring distributions and guessing at thresholds.

Self-Calibrating Eval

Q: What is self-calibrating eval?

Self-calibrating eval is the pattern where evaluation rules monitor their own scoring distribution and recommend threshold adjustments. Instead of manually tuning thresholds, the system observes real output distributions and suggests when thresholds should tighten or loosen — always with human approval.

Eval rules that know when their own thresholds are wrong — and tell you how to fix them.

Definition#

Definition

Self-calibrating eval is the pattern where evaluation rules monitor their own scoring distribution and recommend threshold adjustments. Instead of manually tuning thresholds based on intuition, the system observes real output distributions and suggests when thresholds should tighten or loosen. Adjustments are always human-approved.

Why Static Thresholds Break#

A completeness threshold set at 0.7 might be perfect today — passing genuinely good outputs and catching bad ones. Three weeks later, the same threshold passes everything (because model quality improved) or fails everything (because input patterns shifted). The threshold didn't change. The world around it did. This is eval drift at the threshold level.

Key Data

A 100% pass rate is not a sign of quality — it's a sign your thresholds need tightening. A 100% fail rate is not a sign of failure — it's a sign your thresholds need loosening. Both are calibration problems.

The Pattern#

Monitor

Track the distribution of scores for each eval rule over time.

Detect

Flag when pass rates hit extremes (100% or 0%) or distributions shift significantly.

Recommend

Suggest adjusted thresholds based on observed data. Human approves or rejects.

How Iris Helps#

Iris provides the scoring data that powers self-calibrating eval. Every output is scored with the same rules, building the distribution data needed to detect calibration issues. The dashboard shows pass/fail rates over time — when a rule passes 100% of outputs, it's visible immediately.

Read the deep dive: Self-Calibrating Eval →

TERM

Eval Drift

Self-calibrating eval is the solution to threshold-level drift.

TERM

The Eval Loop

Self-calibration is the 'Calibrate' stage of the eval loop, automated.

TERM

Output Quality Score

A composite metric that benefits from calibrated individual rules.

TERM

Agent Eval