Eval Coverage
The metric that tells you how much of your agent's output is actually being evaluated.
Definition#
Definition
Why 100% Is the Only Target#
In traditional software testing, 80% code coverage is considered good. In agent eval, anything less than 100% creates blind spots. The reason: agents fail intermittently and non-deterministically. The failure you care about most — a hallucinated answer, PII in an output, a cost spike — might happen on the one execution you didn't evaluate.
Key Data
The Sampling Trap#
Some teams evaluate a sample of outputs — 1%, 5%, 10%. This feels reasonable but misses the point. Agent failures cluster in the long tail: unusual inputs, edge cases, specific user contexts. A random 5% sample is overwhelmingly likely to miss these. By definition, the failures that matter most are the ones that happen rarely — and sampling misses rare events.
How Iris Helps#
Iris achieves 100% eval coverage by design. Every output that flows through the MCP protocol gets scored — no sampling, no pipeline, no opt-in. The eval rules run in under one millisecond, so there's no performance reason to sample. Full coverage, zero overhead.
Read the deep dive: Eval Coverage →
Related Concepts#
The Eval Tax
0% coverage = maximum eval tax. Every unscored output adds to the balance.
Eval-Driven Development
Define rules before prompts — coverage is built in from the start.
The Eval Loop
Coverage is the foundation — you can't loop on what you don't measure.
Agent Eval
The complete guide to evaluating AI agent outputs.