Comparison
Inline production eval vs offline test suites. Two different philosophies for evaluating AI agent output.
TL;DR
For background on heuristic vs semantic evaluation, see our evaluation methodology guide.
Feature Comparison
| Feature | Iris | DeepEval |
|---|---|---|
| Eval approach | Dual: deterministic rules (<1ms, free) + LLM-as-Judge (v0.4, 5 templates, cost-capped) | LLM-as-Judge metrics (semantic, slower) |
| Integration method | MCP config (zero code) | Python pytest decorators |
| When eval runs | Inline, every output in production | Offline, batch test suites in CI/CD |
| Language | TypeScript (any MCP agent) | Python only |
| Self-hosting | Single binary, one SQLite file | pip install, local execution |
| Built-in metrics | 13 deterministic rules + 5 LLM-judge templates (accuracy, helpfulness, safety, correctness, faithfulness) + semantic citation verification | 14+ metrics (faithfulness, hallucination, bias, toxicity) |
| Citation verification | SSRF-guarded source fetch + per-claim LLM verdict (v0.4) | Not included |
| Custom metrics | Zod schema custom rules + programmatic MCP deploy_rule | Python custom metrics class |
| Cost tracking | Per-trace USD cost + per-LLM-judge-eval cost + aggregate visibility | Not included |
| Dashboard | Real-time dark-mode UI with Decision Moments + drift detection | Confident AI cloud dashboard (separate product) |
| MCP support | Protocol-native (IS an MCP server; 9 tools) | Not MCP-aware |
| OpenTelemetry export | OTLP/HTTP JSON to Jaeger/Tempo/Datadog (v0.4) | Not included |
| Supply-chain integrity | SBOM + cosign + SLSA build-provenance (v0.4) | Standard pip |
| License | MIT | Apache 2.0 |
| Maturity | Early stage (v0.4.0) | Established (14K+ GitHub stars) |
Decision Guide
Last verified: March 2026. This comparison is based on publicly available documentation and may not reflect recent changes to DeepEval. We aim to keep this page accurate and fair.
See something outdated or incorrect? Report an inaccuracy — we review and update within 48 hours.
Add Iris to your MCP config. First trace in 60 seconds. No SDK, no signup, no infrastructure.