v0.1Iris — The agent eval standard for MCP. 12 eval rules, open source

The Eval Loop

Evals are the loss function for agent quality. The loop is how you optimize.

Definition#

Definition

The eval loop is the continuous cycle of scoring agent outputs, diagnosing failures, calibrating thresholds, and re-scoring. Most teams treat eval as a one-time gate. The eval loop treats it as a continuous feedback signal — the loss function that drives agent quality upward over time.

Four Stages#

1

Score

Run eval rules on every output. Capture pass/fail and scores.

2

Diagnose

Identify which rules fail most. Find patterns in failures.

3

Calibrate

Adjust thresholds based on real distributions. Tighten or loosen.

4

Re-score

Verify calibration improved results. Then repeat continuously.

Loop, Not Gate#

A gate checks once: pass or fail. A loop checks continuously and feeds back into improvement. The teams that build the eval loop first end up with a compounding advantage — they know what actually works, not what sounds like it should work. Each iteration gets tighter because you're working from data, not vibes.

Key Data

The eval loop is to agent quality what the training loop is to model quality. You wouldn't train a model without a loss function. Don't ship an agent without one either.

How Iris Helps#

Iris provides the scoring layer for the eval loop. Every output is scored automatically — the "Score" stage runs with zero effort. The dashboard provides the "Diagnose" stage — see which rules fail, when, and at what rate. Calibration and re-scoring happen as you adjust rules and watch the impact in real-time.

Read the deep dive: The Eval Loop →

Frequently Asked Questions#