How is EDD different from TDD?

TDD asserts exact outputs: given input X, expect output Y. EDD scores outputs on dimensions: completeness above 0.7, no PII detected, cost under $0.05. Agent outputs are non-deterministic, so exact-match assertions don't work. EDD uses scoring rules instead.

What are the benefits of EDD?

Three main benefits: (1) you define success criteria before building, preventing scope creep; (2) every prompt iteration is measurable — you know if changes improved or degraded quality; (3) eval rules survive the entire lifecycle from development through production.

Eval-Driven Development

Q: What is Eval-Driven Development?

Eval-Driven Development (EDD) is the practice of defining evaluation rules before writing agent prompts — the same way test-driven development defines tests before writing code. You specify what 'correct' looks like first, then build the agent to pass those rules.

Write the rules before the prompt. TDD for AI agents.

Definition#

Definition

Eval-Driven Development (EDD)is the practice of defining evaluation rules before writing agent prompts — the same way test-driven development defines tests before writing code. You specify what "correct" looks like first, then build the agent to pass those rules. Every prompt iteration is measurable.

The EDD Cycle#

Define Rules

What does "correct" look like? Set thresholds for quality, safety, cost.

Write Prompt

Build the agent prompt to meet the rules you defined.

Score Outputs

Run the agent. Eval rules score every output automatically.

Iterate

Refine prompts based on scores. Repeat until rules pass consistently.

EDD vs TDD#

Dimension	TDD	EDD
Assertion type	Exact match	Score threshold
Output model	Deterministic	Non-deterministic
Runs in prod?	No (CI only)	Yes (every output)
What you define first	Test cases	Eval rules

How Iris Helps#

Iris makes EDD practical. Define your eval rules, add Iris to your MCP config, and every agent output is scored against those rules automatically. The same rules that guide development continue running in production — no separate test harness needed.

Read the deep dive: Eval-Driven Development →

TERM

Eval Coverage

EDD naturally produces 100% coverage — rules run on every output.

TERM

The Eval Loop

EDD is the starting point. The eval loop is what happens next — continuous iteration.

TERM

The Eval Gap

EDD closes the gap by ensuring production uses the same rules as development.

TERM

Agent Eval