What is eval coverage?

Eval coverage is the percentage of agent executions that receive automated evaluation. If your agent handles 1,000 requests and 50 are evaluated, your eval coverage is 5%. Most teams are at 0% — no outputs are scored at all.

Why does eval coverage matter?

Agents fail intermittently and silently. A 5% sample might miss every failure. The edge cases, drift events, and rare safety violations that matter most are exactly the ones sampling misses. 100% coverage means every output is scored — no blind spots.

How do you achieve 100% eval coverage?

By making evaluation zero-effort. If eval requires a pipeline, SDK integration, or manual setup, teams run it on samples or skip it entirely. Iris achieves 100% coverage by evaluating inline at the protocol layer — every output is scored automatically with no additional code.

Eval Coverage

The metric that tells you how much of your agent's output is actually being evaluated.

Definition#

Definition

Eval coverage is the percentage of agent executions that receive automated evaluation. An agent handling 1,000 requests with 50 evaluated has 5% coverage. Most teams are at 0%. The target is 100% — every output scored, every time.

Why 100% Is the Only Target#

In traditional software testing, 80% code coverage is considered good. In agent eval, anything less than 100% creates blind spots. The reason: agents fail intermittently and non-deterministically. The failure you care about most — a hallucinated answer, PII in an output, a cost spike — might happen on the one execution you didn't evaluate.

Key Data

Test coverage asks "did we test this code path?" Eval coverage asks "did we score this output?" The first is about code. The second is about every individual execution.

The Sampling Trap#

Some teams evaluate a sample of outputs — 1%, 5%, 10%. This feels reasonable but misses the point. Agent failures cluster in the long tail: unusual inputs, edge cases, specific user contexts. A random 5% sample is overwhelmingly likely to miss these. By definition, the failures that matter most are the ones that happen rarely — and sampling misses rare events.

How Iris Helps#

Iris achieves 100% eval coverage by design. Every output that flows through the MCP protocol gets scored — no sampling, no pipeline, no opt-in. The eval rules run in under one millisecond, so there's no performance reason to sample. Full coverage, zero overhead.

Read the deep dive: Eval Coverage →

TERM

The Eval Tax

0% coverage = maximum eval tax. Every unscored output adds to the balance.

TERM

Eval-Driven Development

Define rules before prompts — coverage is built in from the start.

TERM

The Eval Loop

Coverage is the foundation — you can't loop on what you don't measure.

TERM

Agent Eval