v0.1Iris — The agent eval standard for MCP. 12 eval rules, open source

Eval Coverage

The metric that tells you how much of your agent's output is actually being evaluated.

Definition#

Definition

Eval coverage is the percentage of agent executions that receive automated evaluation. An agent handling 1,000 requests with 50 evaluated has 5% coverage. Most teams are at 0%. The target is 100% — every output scored, every time.

Why 100% Is the Only Target#

In traditional software testing, 80% code coverage is considered good. In agent eval, anything less than 100% creates blind spots. The reason: agents fail intermittently and non-deterministically. The failure you care about most — a hallucinated answer, PII in an output, a cost spike — might happen on the one execution you didn't evaluate.

Key Data

Test coverage asks "did we test this code path?" Eval coverage asks "did we score this output?" The first is about code. The second is about every individual execution.

The Sampling Trap#

Some teams evaluate a sample of outputs — 1%, 5%, 10%. This feels reasonable but misses the point. Agent failures cluster in the long tail: unusual inputs, edge cases, specific user contexts. A random 5% sample is overwhelmingly likely to miss these. By definition, the failures that matter most are the ones that happen rarely — and sampling misses rare events.

How Iris Helps#

Iris achieves 100% eval coverage by design. Every output that flows through the MCP protocol gets scored — no sampling, no pipeline, no opt-in. The eval rules run in under one millisecond, so there's no performance reason to sample. Full coverage, zero overhead.

Read the deep dive: Eval Coverage →

Frequently Asked Questions#