v0.4.0Iris v0.4 — LLM-as-Judge + citation verify + OTel + 9 MCP tools
All posts

Eval Coverage: The Metric Your AI Agents Are Missing

·Ian Parent
eval-coverageagent-evaltestingmcpvocabulary

Every serious codebase measures test coverage. CI pipelines enforce minimums. Pull requests get rejected when coverage drops. The industry spent two decades making this a standard practice.

For AI agents, the equivalent metric doesn't exist yet. It should. It's called eval coverage — the percentage of agent executions that receive an evaluation.

The Current State: Nearly Zero

The numbers are stark. From LangChain's State of Agent Engineering survey (1,340 respondents, late 2025):

The majority of companies building AI agents in production are running at effectively 0% eval coverage on live traffic. They are paying the eval tax on every unscored execution. They're shipping code without tests — except the code is non-deterministic, the failures are silent, and the consequences are user-facing.

Why Agent Eval Coverage Is Different from Test Coverage

In traditional software, test coverage measures what percentage of code paths your test suite exercises. Tools like Istanbul and Coverage.py make this measurable. The industry settled on 80-85% as the pragmatic target — high enough to catch most regressions, not so exhaustive that tests cost more than the code they protect.

For AI agents, coverage is structurally different. It's not about code paths — it's about executions. An agent can have 100% code test coverage — every function tested — and still produce garbage outputs in production, because the behavior lives in the model's probability distribution, not in deterministic code.

This means coverage must be measured at the output level: what percentage of actual agent outputs were evaluated for quality, safety, and cost?

Why 100% Eval Coverage Matters

In software, 80% test coverage is considered good. An uncovered branch might be dead code that never runs. But with agent outputs, there is no dead code. Every call is a real user interaction with real consequences.

Spot-checking 25% of runs is not "mostly covered." It means 75% of your production failures are invisible. The failure that leaks PII, the hallucination that sends a customer wrong data, the $40 API call that should have been $0.12 — these live in the long tail, and they're the ones that generate lawsuits, churn, and trust destruction.

The Coverage Spectrum

LevelWhat It MeansWhat You Miss
0%No eval, everEverything. Flying blind.
25%Spot checks, manual review75% of failures invisible
50%Sampling — eval 1-in-2 callsHalf your production failures
80%What software considers "good"20% blind spots — still risky for agents
100%Every execution evaluated inlineFull visibility. Drift detectable from day one.

The Test Coverage History Parallel

The journey from "tests are optional" to "shipping without tests is unprofessional" took about 15 years:

A joint IBM and Microsoft study shows TDD reduces post-release bugs by 40-90%.

Where are we with agent eval? Somewhere around 1999. The practice exists. A few leading teams use it. The tooling is emerging. The industry standard hasn't formed yet.

History is about to rhyme. The discipline that accelerates adoption is Eval-Driven Development — writing eval rules before prompts, the same way TDD writes tests before code.

How to Get to 100%

The reason most teams run at 0% eval coverage is that adding per-call evaluation is manual, fragile, and easy to forget. As we show in How to Evaluate Agent Output Without Calling Another LLM, heuristic rules make per-call evaluation fast and free enough to run on every execution. The same reason test coverage was low before CI made it structural.

The path to 100% follows the same pattern:

  1. Make it structural, not discretionary. If evaluation requires developers to add per-call instrumentation, coverage will always be incomplete. If evaluation is built into the protocol layer — the communication channel every agent already uses — coverage is automatic.

  2. Measure it. You can't improve what you don't measure. Track your eval coverage as a metric: (evaluated executions / total executions) × 100.

  3. Alert on drops. When eval coverage drops below 100%, something is misconfigured. Treat it like test coverage: a metric that goes in one direction.

The Iris Approach

Iris enables high eval coverage by integrating at the MCP protocol layer. Agents call Iris eval tools inline — the same way they call any other MCP tool — keeping evaluation within the agent's own workflow rather than requiring a separate instrumentation pass.

The architectural advantage: when eval is an MCP tool the agent can invoke on any output, adding coverage doesn't require per-call instrumentation in your application code. You configure Iris once, and the agent has access to eval on every execution.

This is why the coverage framing matters: protocol-native eval makes high coverage a matter of agent configuration, not developer discipline. The same way CI pipelines made test coverage structural, MCP-native eval makes agent eval coverage structural.

For the complete picture, see our Agent Eval: The Definitive Guide.


Iris is the agent eval standard for MCP. Add it to your MCP config and start scoring agent outputs inline. Try it: iris-eval.com/playground

Continue Reading

See what your agents are actually doing

Add Iris to your MCP config. First trace in 60 seconds. No SDK, no signup.