v0.1Iris — The agent eval standard for MCP. 12 eval rules, open source

Self-Calibrating Eval

Eval rules that know when their own thresholds are wrong — and tell you how to fix them.

Definition#

Definition

Self-calibrating eval is the pattern where evaluation rules monitor their own scoring distribution and recommend threshold adjustments. Instead of manually tuning thresholds based on intuition, the system observes real output distributions and suggests when thresholds should tighten or loosen. Adjustments are always human-approved.

Why Static Thresholds Break#

A completeness threshold set at 0.7 might be perfect today — passing genuinely good outputs and catching bad ones. Three weeks later, the same threshold passes everything (because model quality improved) or fails everything (because input patterns shifted). The threshold didn't change. The world around it did. This is eval drift at the threshold level.

Key Data

A 100% pass rate is not a sign of quality — it's a sign your thresholds need tightening. A 100% fail rate is not a sign of failure — it's a sign your thresholds need loosening. Both are calibration problems.

The Pattern#

1

Monitor

Track the distribution of scores for each eval rule over time.

2

Detect

Flag when pass rates hit extremes (100% or 0%) or distributions shift significantly.

3

Recommend

Suggest adjusted thresholds based on observed data. Human approves or rejects.

How Iris Helps#

Iris provides the scoring data that powers self-calibrating eval. Every output is scored with the same rules, building the distribution data needed to detect calibration issues. The dashboard shows pass/fail rates over time — when a rule passes 100% of outputs, it's visible immediately.

Read the deep dive: Self-Calibrating Eval →

Frequently Asked Questions#