What is the difference between Iris and DeepEval?

Iris is an MCP-native eval server that scores every agent output inline in production. It ships both tracks: deterministic heuristic rules (millisecond, free) and LLM-as-Judge + semantic citation verification (semantic, cost-capped) as of v0.4. DeepEval is a Python testing framework that runs offline eval suites via pytest. Iris requires zero code changes and runs in real-time. DeepEval requires Python test files and runs as a batch testing step.

Should I use Iris or DeepEval for agent evaluation?

Use Iris if you want inline production evaluation with zero setup — add one line to your MCP config and every agent output gets scored automatically, with both deterministic heuristics and LLM-as-Judge + citation verification available in the same tool surface. Use DeepEval if you're building in Python and want pytest-native eval workflows running offline in CI/CD.

Iris vs DeepEval — MCP-Native Inline Eval vs Python Testing Framework

TL;DR

Iris is an MCP server that scores every agent output inline in production — zero code changes, deterministic rules, sub-millisecond overhead. DeepEval is a Python testing framework that runs LLM-as-Judge evaluation suites via pytest — powerful semantic metrics, designed for CI/CD pipelines. If you want every production output scored automatically, Iris. If you want deep semantic evaluation in your test pipeline, DeepEval.

For background on heuristic vs semantic evaluation, see our evaluation methodology guide.

Feature Comparison

Side by side.

Feature	Iris	DeepEval
Eval approach	Dual: deterministic rules (<1ms, free) + LLM-as-Judge (v0.4, 5 templates, cost-capped)	LLM-as-Judge metrics (semantic, slower)
Integration method	MCP config (zero code)	Python pytest decorators
When eval runs	Inline, every output in production	Offline, batch test suites in CI/CD
Language	TypeScript (any MCP agent)	Python only
Self-hosting	Single binary, one SQLite file	pip install, local execution
Built-in metrics	13 deterministic rules + 5 LLM-judge templates (accuracy, helpfulness, safety, correctness, faithfulness) + semantic citation verification	14+ metrics (faithfulness, hallucination, bias, toxicity)
Citation verification	SSRF-guarded source fetch + per-claim LLM verdict (v0.4)	Not included
Custom metrics	Zod schema custom rules + programmatic MCP deploy_rule	Python custom metrics class
Cost tracking	Per-trace USD cost + per-LLM-judge-eval cost + aggregate visibility	Not included
Dashboard	Real-time dark-mode UI with Decision Moments + drift detection	Confident AI cloud dashboard (separate product)
MCP support	Protocol-native (IS an MCP server; 9 tools)	Not MCP-aware
OpenTelemetry export	OTLP/HTTP JSON to Jaeger/Tempo/Datadog (v0.4)	Not included
Supply-chain integrity	SBOM + cosign + SLSA build-provenance (v0.4)	Standard pip
License	MIT	Apache 2.0
Maturity	Early stage (v0.4.0)	Established (14K+ GitHub stars)

Decision Guide

Which one fits your workflow?

When to choose Iris

You want eval running on every output in production, not just in test suites
You're building with MCP-compatible agents and want zero-code integration
You need cost tracking and aggregate spend visibility across agents
You want a single self-hosted binary with no Python dependency
You want both deterministic (fast, free) and LLM-as-Judge (semantic, cost-capped) in the same tool surface
You need semantic citation verification — checking whether cited sources actually support the claim
You want OpenTelemetry export to your existing Jaeger / Tempo / Datadog stack

When to choose DeepEval

You're building in Python and want pytest-native eval workflows
You want to run eval suites in CI/CD pipelines before deployment
You need a mature ecosystem with extensive documentation and community
You want the Confident AI cloud platform for team collaboration

Last verified: March 2026. This comparison is based on publicly available documentation and may not reflect recent changes to DeepEval. We aim to keep this page accurate and fair.

See something outdated or incorrect? Report an inaccuracy — we review and update within 48 hours.

Iris vs DeepEval

Side by side.

Which one fits your workflow?

When to choose Iris

When to choose DeepEval

Ready to see what your agents are doing?