Agent eval is the systematic scoring of every output produced by an AI agent for quality, safety, and cost. Unlike infrastructure monitoring that tells you a request succeeded, agent eval tells you whether the output was actually correct, safe, and cost-efficient.

What is the difference between agent eval and model benchmarking?

Model benchmarking tests a model's capabilities before deployment using standardized datasets. Agent eval scores real production outputs on every execution. Benchmarking answers 'can this model do the task?' while agent eval answers 'did this agent do the task correctly right now?'

Do I need LLM-as-Judge to evaluate agent outputs?

Not necessarily. Heuristic evaluation uses deterministic rules (pattern matching, threshold checks, PII regex) that run in under one millisecond with zero cost and perfect consistency. LLM-as-Judge is useful for assessing subjective quality, but many production use cases are fully covered by heuristic rules.

How often should I evaluate agent outputs?

Every execution. The goal is 100% eval coverage — scoring every output, not sampling. Agents fail intermittently and silently. Sampling-based evaluation misses the exact failures that matter most: edge cases, drift, and rare but catastrophic safety violations.

Eval drift is the silent degradation of agent output quality over time, even when your code and prompts haven't changed. It's caused by upstream model updates, shifting input distributions, and environmental changes. Without continuous evaluation, drift goes undetected until users report failures.

What is eval-driven development?

Eval-driven development (EDD) is the practice of defining evaluation rules before writing agent prompts — the same way test-driven development defines tests before writing code. You specify what 'correct' looks like first, then build the agent to pass those rules.

How does agent eval differ from traditional software testing?

Traditional testing verifies deterministic behavior: given input X, expect output Y. Agent outputs are non-deterministic — the same input can produce different outputs across runs. Agent eval uses scoring rules (quality thresholds, safety patterns, cost limits) instead of exact-match assertions.

What is the eval tax?

The eval tax is the compounding cost of every unscored agent output — measured in trust erosion, engineering hours spent on manual review, and liability exposure. Teams without agent eval pay this tax on every execution, whether they realize it or not.

Agent Eval: The Definitive Guide to AI Agent Evaluation

What Is Agent Eval?#

Definition

Agent eval is the systematic scoring of every output produced by an AI agent for quality, safety, and cost. It answers the question that infrastructure monitoring cannot: was the output actually correct?

When an AI agent processes a request, traditional monitoring tells you the request completed. It tells you the latency, the status code, the resource consumption. What it does not tell you is whether the agent's response was accurate, safe, and worth what it cost.

Agent eval fills this gap. It runs inline on every execution in production — not as a one-time pre-deployment gate, not as an offline batch job, but as a continuous quality layer that scores every output the moment it is produced.

This is different from adjacent practices that are often confused with agent eval:

Model benchmarking tests capabilities before deployment using standardized datasets. Agent eval scores real outputs in production.
Prompt engineering optimizes instructions at development time. Agent eval verifies those instructions produce correct results at runtime.
Infrastructure monitoring (APM) tracks whether systems are running. Agent eval tracks whether outputs are good.

The distinction matters because agents fail in ways traditional systems do not. They produce outputs that look correct but contain fabricated facts. They leak personally identifiable information. They silently degrade when upstream models change. These failure modes require evaluation at the output layer, not the infrastructure layer.

Why Agent Eval Matters#

The Cost of Not Evaluating#

Industry Data

AI hallucinations caused an estimated $67.4 billion in global financial losses in 2024. AI safety incidents surged 56.4% year-over-year according to the Stanford AI Index.

Every unscored agent output carries a cost — in trust, in engineering hours, and in liability. This is the eval tax: the compounding price teams pay for operating agents without systematic evaluation. The longer an agent runs unevaluated, the higher the accumulated risk.

TERM

The Eval Tax

The compounding cost of every unscored agent output — in trust, engineering hours, and liability.

Silent Degradation#

Agent quality degrades without any code changes. Upstream model providers update their models without notice — a study by Stanford and Berkeley found that GPT-4's code generation accuracy dropped from 52% to 10% between March and June 2023 with no changelog entry. Research shows 91% of machine learning models experience performance degradation over time.

This phenomenon is eval drift: the silent erosion of output quality driven by provider updates, shifting input distributions, and environmental changes. Without continuous evaluation, drift goes undetected until users report failures.

TERM

Eval Drift

Silent degradation of agent output quality over time — even when your code and prompts haven't changed.

The Production Gap#

Industry Data

Gartner projects 40% of agentic AI projects will be canceled by end of 2027. A LangChain survey found only 37% of teams run evaluations on production traffic.

The distance between “the demo works” and “production works” is the eval gap. Demos operate on curated inputs with forgiving audiences. Production operates on adversarial inputs at scale with real consequences. Agent eval is the bridge between the two.

TERM

The Eval Gap

The distance between 'demo works' and 'production works' — driven by input distribution, compound failure, and cost reality.

Where Agent Eval Fits: The Three-Layer Model#

Agent systems require three distinct layers of observability. Most teams have infrastructure monitoring. Emerging teams add protocol-level tracing. The layer that almost everyone is missing is output quality evaluation.

Layer 3 — Output Quality

Was the output correct?

Quality scoring, safety checks, cost thresholds

Iris

Layer 2 — Protocol

What did the agent do?

MCP call completion, tool invocations, message routing

Emerging

Layer 1 — Infrastructure

Did the request succeed?

Uptime, latency, error rates, resource utilization

Established

Most teams have Layer 1. Emerging teams have Layer 2. The missing layer is Layer 3.

Each layer answers a different question. Infrastructure monitoring tells you the server is up. Protocol monitoring tells you the MCP calls completed. Agent eval tells you whether the response was actually correct, safe, and worth what it cost. These layers are complementary — you need all three.

Agent Eval Methodologies#

There are two fundamental approaches to evaluating agent outputs, each with distinct tradeoffs. Most production systems benefit from combining both.

Heuristic Eval#

Heuristic evaluation uses deterministic, pattern-based rules to score agent outputs. These rules are fast (sub-millisecond), free (no API calls), and perfectly consistent — the same input always produces the same score.

Common heuristic eval rules include:

PII detection — regex patterns for SSN, credit card, phone, email
Prompt injection detection — patterns that indicate the agent is being manipulated
Output length thresholds — responses that are suspiciously short or long
Cost limits — token usage and execution cost exceeding budgets
Blocklist enforcement — prohibited words, competitor names, harmful content

Semantic Eval (LLM-as-Judge)#

Semantic evaluation uses a language model to assess the meaning and quality of agent outputs. An LLM scores responses against a rubric — checking for factual accuracy, coherence, helpfulness, and relevance.

The tradeoffs are real: semantic eval is slower (seconds, not milliseconds), costs money (each evaluation is an API call), and introduces non-determinism (the judge model can produce different scores for the same input). It also creates a recursive trust problem: who evaluates the evaluator?

Hybrid Approach#

The most effective production systems use both. Heuristic rules handle safety gates (PII, injection, cost) — these are non-negotiable, binary checks that must run on every execution at zero latency. Semantic eval handles quality assessment for a subset of outputs where subjective judgment matters. This layered approach gives you full coverage without the cost of running LLM-as-Judge on every request.

For a deep comparison of these approaches, see How to Evaluate Agent Output Without Calling Another LLM.

What to Evaluate: The Four Categories#

Agent eval rules typically fall into four categories. A comprehensive evaluation system covers all four, targeting 100% eval coverage — scoring every execution, not sampling.

Completeness

Did the response address the full request? Topic consistency, expected elements, response coverage.

Relevance

Is the response appropriate? Language match, format validation, sentiment alignment.

Safety

Is the response safe to deliver? PII detection, hallucination markers, injection patterns, blocklist.

Cost

Was the response worth the resources? Token usage, execution cost, latency thresholds.

TERM

Eval Coverage

The percentage of agent executions that receive evaluation. The target is 100% — every output scored, not sampled.

Eval-Driven Development#

Test-driven development (TDD) changed how software teams write code: define the test first, then write the implementation. Research from IBM and Microsoft found that TDD reduces production defects by 40-90%.

Eval-driven development (EDD) applies the same discipline to AI agents. Define what “correct” looks like before writing the prompt. The workflow:

Define rules — specify eval criteria for quality, safety, and cost
Write prompt — build the agent with rules as the quality contract
Run eval — score real outputs against the rules
Iterate — adjust prompts, tools, or architecture based on scores
Lock rules — once passing, rules become the regression suite

TERM

Eval-Driven Development (EDD)

Define evaluation rules before writing agent prompts — the same way TDD defines tests before writing code.

The Eval Loop: Continuous Quality#

Agent evaluation is not a one-time event. Models change, inputs drift, and thresholds that made sense last month may not make sense today. The eval loop is the continuous cycle that keeps evaluation calibrated:

ScoreDiagnoseCalibrateRe-score

Advanced systems implement self-calibrating evaluation: the system monitors its own scoring distributions and recommends threshold adjustments when the environment shifts. This closes the loop automatically, reducing the manual burden of maintaining eval rules over time.

Getting Started#

Iris is the agent eval standard for MCP. Any MCP-compatible agent can discover and use it automatically — no SDK, no code changes. Add it to your MCP configuration:

{
  "mcpServers": {
    "iris": {
      "command": "npx",
      "args": ["-y", "@iris-eval/mcp-server"]
    }
  }
}

Iris provides nine MCP tools — full lifecycle plus semantic eval. Core: log_trace for recording agent executions, evaluate_output for scoring outputs against 13 built-in eval rules, and get_traces for querying historical data. Lifecycle: list_rules / deploy_rule / delete_rule / delete_trace. Semantic: evaluate_with_llm_judge (5 templates, cost-capped) and verify_citations (SSRF-guarded source fetch + per-claim verdict).

Try the Playground View on GitHub

Frequently Asked Questions#

References#

Stanford AI Index Report 2025 — AI safety incident trends, global adoption metrics
Chen, L. et al. (2023) “How Is ChatGPT's Behavior Changing over Time?” — Stanford/Berkeley GPT-4 drift study (arXiv:2307.09009)
IBM/Microsoft joint study — Test-driven development reduces production defects 40-90%
Gartner (2025) — 40% of agentic AI projects projected to be canceled by 2027
LangChain State of AI Agents (2025) — Production evaluation adoption rates
EU AI Act — Article 14 human oversight requirements, effective August 2026

Last updated March 2026. This guide is maintained by the Iris team and updated as the agent eval landscape evolves.