What Is Agent Eval?#
Definition
When an AI agent processes a request, traditional monitoring tells you the request completed. It tells you the latency, the status code, the resource consumption. What it does not tell you is whether the agent's response was accurate, safe, and worth what it cost.
Agent eval fills this gap. It runs inline on every execution in production — not as a one-time pre-deployment gate, not as an offline batch job, but as a continuous quality layer that scores every output the moment it is produced.
This is different from adjacent practices that are often confused with agent eval:
- Model benchmarking tests capabilities before deployment using standardized datasets. Agent eval scores real outputs in production.
- Prompt engineering optimizes instructions at development time. Agent eval verifies those instructions produce correct results at runtime.
- Infrastructure monitoring (APM) tracks whether systems are running. Agent eval tracks whether outputs are good.
The distinction matters because agents fail in ways traditional systems do not. They produce outputs that look correct but contain fabricated facts. They leak personally identifiable information. They silently degrade when upstream models change. These failure modes require evaluation at the output layer, not the infrastructure layer.
Why Agent Eval Matters#
The Cost of Not Evaluating#
Industry Data
Every unscored agent output carries a cost — in trust, in engineering hours, and in liability. This is the eval tax: the compounding price teams pay for operating agents without systematic evaluation. The longer an agent runs unevaluated, the higher the accumulated risk.
TERMThe Eval Tax
The compounding cost of every unscored agent output — in trust, engineering hours, and liability.
Silent Degradation#
Agent quality degrades without any code changes. Upstream model providers update their models without notice — a study by Stanford and Berkeley found that GPT-4's code generation accuracy dropped from 52% to 10% between March and June 2023 with no changelog entry. Research shows 91% of machine learning models experience performance degradation over time.
This phenomenon is eval drift: the silent erosion of output quality driven by provider updates, shifting input distributions, and environmental changes. Without continuous evaluation, drift goes undetected until users report failures.
TERMEval Drift
Silent degradation of agent output quality over time — even when your code and prompts haven't changed.
The Production Gap#
Industry Data
The distance between “the demo works” and “production works” is the eval gap. Demos operate on curated inputs with forgiving audiences. Production operates on adversarial inputs at scale with real consequences. Agent eval is the bridge between the two.
TERMThe Eval Gap
The distance between 'demo works' and 'production works' — driven by input distribution, compound failure, and cost reality.
Where Agent Eval Fits: The Three-Layer Model#
Agent systems require three distinct layers of observability. Most teams have infrastructure monitoring. Emerging teams add protocol-level tracing. The layer that almost everyone is missing is output quality evaluation.
Layer 3 — Output Quality
Was the output correct?
Quality scoring, safety checks, cost thresholds
Layer 2 — Protocol
What did the agent do?
MCP call completion, tool invocations, message routing
Layer 1 — Infrastructure
Did the request succeed?
Uptime, latency, error rates, resource utilization
Most teams have Layer 1. Emerging teams have Layer 2. The missing layer is Layer 3.
Each layer answers a different question. Infrastructure monitoring tells you the server is up. Protocol monitoring tells you the MCP calls completed. Agent eval tells you whether the response was actually correct, safe, and worth what it cost. These layers are complementary — you need all three.
Agent Eval Methodologies#
There are two fundamental approaches to evaluating agent outputs, each with distinct tradeoffs. Most production systems benefit from combining both.
Heuristic Eval#
Heuristic evaluation uses deterministic, pattern-based rules to score agent outputs. These rules are fast (sub-millisecond), free (no API calls), and perfectly consistent — the same input always produces the same score.
Common heuristic eval rules include:
- PII detection — regex patterns for SSN, credit card, phone, email
- Prompt injection detection — patterns that indicate the agent is being manipulated
- Output length thresholds — responses that are suspiciously short or long
- Cost limits — token usage and execution cost exceeding budgets
- Blocklist enforcement — prohibited words, competitor names, harmful content
Semantic Eval (LLM-as-Judge)#
Semantic evaluation uses a language model to assess the meaning and quality of agent outputs. An LLM scores responses against a rubric — checking for factual accuracy, coherence, helpfulness, and relevance.
The tradeoffs are real: semantic eval is slower (seconds, not milliseconds), costs money (each evaluation is an API call), and introduces non-determinism (the judge model can produce different scores for the same input). It also creates a recursive trust problem: who evaluates the evaluator?
Hybrid Approach#
The most effective production systems use both. Heuristic rules handle safety gates (PII, injection, cost) — these are non-negotiable, binary checks that must run on every execution at zero latency. Semantic eval handles quality assessment for a subset of outputs where subjective judgment matters. This layered approach gives you full coverage without the cost of running LLM-as-Judge on every request.
For a deep comparison of these approaches, see How to Evaluate Agent Output Without Calling Another LLM.
What to Evaluate: The Four Categories#
Agent eval rules typically fall into four categories. A comprehensive evaluation system covers all four, targeting 100% eval coverage — scoring every execution, not sampling.
Completeness
Did the response address the full request? Topic consistency, expected elements, response coverage.
Relevance
Is the response appropriate? Language match, format validation, sentiment alignment.
Safety
Is the response safe to deliver? PII detection, hallucination markers, injection patterns, blocklist.
Cost
Was the response worth the resources? Token usage, execution cost, latency thresholds.
Eval Coverage
The percentage of agent executions that receive evaluation. The target is 100% — every output scored, not sampled.
Eval-Driven Development#
Test-driven development (TDD) changed how software teams write code: define the test first, then write the implementation. Research from IBM and Microsoft found that TDD reduces production defects by 40-90%.
Eval-driven development (EDD) applies the same discipline to AI agents. Define what “correct” looks like before writing the prompt. The workflow:
- Define rules — specify eval criteria for quality, safety, and cost
- Write prompt — build the agent with rules as the quality contract
- Run eval — score real outputs against the rules
- Iterate — adjust prompts, tools, or architecture based on scores
- Lock rules — once passing, rules become the regression suite
Eval-Driven Development (EDD)
Define evaluation rules before writing agent prompts — the same way TDD defines tests before writing code.
The Eval Loop: Continuous Quality#
Agent evaluation is not a one-time event. Models change, inputs drift, and thresholds that made sense last month may not make sense today. The eval loop is the continuous cycle that keeps evaluation calibrated:
Advanced systems implement self-calibrating evaluation: the system monitors its own scoring distributions and recommends threshold adjustments when the environment shifts. This closes the loop automatically, reducing the manual burden of maintaining eval rules over time.
Getting Started#
Iris is the agent eval standard for MCP. Any MCP-compatible agent can discover and use it automatically — no SDK, no code changes. Add it to your MCP configuration:
{
"mcpServers": {
"iris": {
"command": "npx",
"args": ["-y", "@iris-eval/mcp-server"]
}
}
}Iris provides three MCP tools: log_trace for recording agent executions, evaluate_output for scoring outputs against 12 built-in eval rules, and get_traces for querying historical data.
Frequently Asked Questions#
References#
- Stanford AI Index Report 2025 — AI safety incident trends, global adoption metrics
- Chen, L. et al. (2023) “How Is ChatGPT's Behavior Changing over Time?” — Stanford/Berkeley GPT-4 drift study (arXiv:2307.09009)
- IBM/Microsoft joint study — Test-driven development reduces production defects 40-90%
- Gartner (2025) — 40% of agentic AI projects projected to be canceled by 2027
- LangChain State of AI Agents (2025) — Production evaluation adoption rates
- EU AI Act — Article 14 human oversight requirements, effective August 2026
Last updated March 2026. This guide is maintained by the Iris team and updated as the agent eval landscape evolves.