What is the eval loop?

The eval loop is the continuous cycle of scoring agent outputs, diagnosing failures, calibrating thresholds, and re-scoring. Unlike one-time evaluation gates, the eval loop treats evals as a continuous feedback signal — the loss function for agent quality.

How is the eval loop different from running evals once?

Running evals once tells you if an agent passed or failed at a point in time. The eval loop runs continuously, tracking quality trends over time. It catches drift, reveals calibration issues, and drives iterative improvement. One-time eval is a gate. The eval loop is a feedback system.

What are the four stages of the eval loop?

Score (run eval rules on every output), Diagnose (identify which rules fail and why), Calibrate (adjust thresholds based on observed distributions), Re-score (verify calibration improved results). Then repeat continuously.

The Eval Loop

Evals are the loss function for agent quality. The loop is how you optimize.

Definition#

Definition

The eval loop is the continuous cycle of scoring agent outputs, diagnosing failures, calibrating thresholds, and re-scoring. Most teams treat eval as a one-time gate. The eval loop treats it as a continuous feedback signal — the loss function that drives agent quality upward over time.

Four Stages#

Score

Run eval rules on every output. Capture pass/fail and scores.

Diagnose

Identify which rules fail most. Find patterns in failures.

Calibrate

Adjust thresholds based on real distributions. Tighten or loosen.

Re-score

Verify calibration improved results. Then repeat continuously.

Loop, Not Gate#

A gate checks once: pass or fail. A loop checks continuously and feeds back into improvement. The teams that build the eval loop first end up with a compounding advantage — they know what actually works, not what sounds like it should work. Each iteration gets tighter because you're working from data, not vibes.

Key Data

The eval loop is to agent quality what the training loop is to model quality. You wouldn't train a model without a loss function. Don't ship an agent without one either.

How Iris Helps#

Iris provides the scoring layer for the eval loop. Every output is scored automatically — the "Score" stage runs with zero effort. The dashboard provides the "Diagnose" stage — see which rules fail, when, and at what rate. Calibration and re-scoring happen as you adjust rules and watch the impact in real-time.

Read the deep dive: The Eval Loop →

TERM

Eval Drift

The loop catches drift — quality degradation becomes visible as a downward trend.

TERM

Self-Calibrating Eval

The next evolution — automated threshold calibration within the loop.

TERM

Eval-Driven Development

EDD starts the loop. Once rules are defined, the loop keeps them honest.

TERM

Agent Eval