Eval-Driven Development
Write the rules before the prompt. TDD for AI agents.
Definition#
Definition
The EDD Cycle#
Define Rules
What does "correct" look like? Set thresholds for quality, safety, cost.
Write Prompt
Build the agent prompt to meet the rules you defined.
Score Outputs
Run the agent. Eval rules score every output automatically.
Iterate
Refine prompts based on scores. Repeat until rules pass consistently.
EDD vs TDD#
| Dimension | TDD | EDD |
|---|---|---|
| Assertion type | Exact match | Score threshold |
| Output model | Deterministic | Non-deterministic |
| Runs in prod? | No (CI only) | Yes (every output) |
| What you define first | Test cases | Eval rules |
How Iris Helps#
Iris makes EDD practical. Define your eval rules, add Iris to your MCP config, and every agent output is scored against those rules automatically. The same rules that guide development continue running in production — no separate test harness needed.
Read the deep dive: Eval-Driven Development →
Related Concepts#
Eval Coverage
EDD naturally produces 100% coverage — rules run on every output.
The Eval Loop
EDD is the starting point. The eval loop is what happens next — continuous iteration.
The Eval Gap
EDD closes the gap by ensuring production uses the same rules as development.
Agent Eval
The complete guide to evaluating AI agent outputs.