The Eval Loop
Evals are the loss function for agent quality. The loop is how you optimize.
Definition#
Definition
Four Stages#
Score
Run eval rules on every output. Capture pass/fail and scores.
Diagnose
Identify which rules fail most. Find patterns in failures.
Calibrate
Adjust thresholds based on real distributions. Tighten or loosen.
Re-score
Verify calibration improved results. Then repeat continuously.
Loop, Not Gate#
A gate checks once: pass or fail. A loop checks continuously and feeds back into improvement. The teams that build the eval loop first end up with a compounding advantage — they know what actually works, not what sounds like it should work. Each iteration gets tighter because you're working from data, not vibes.
Key Data
How Iris Helps#
Iris provides the scoring layer for the eval loop. Every output is scored automatically — the "Score" stage runs with zero effort. The dashboard provides the "Diagnose" stage — see which rules fail, when, and at what rate. Calibration and re-scoring happen as you adjust rules and watch the impact in real-time.
Read the deep dive: The Eval Loop →
Related Concepts#
Eval Drift
The loop catches drift — quality degradation becomes visible as a downward trend.
Self-Calibrating Eval
The next evolution — automated threshold calibration within the loop.
Eval-Driven Development
EDD starts the loop. Once rules are defined, the loop keeps them honest.
Agent Eval
The complete guide to evaluating AI agent outputs.