v0.4.0Iris v0.4 — LLM-as-Judge + citation verify + OTel + 9 MCP tools

Comparison

Iris vs DeepEval

Inline production eval vs offline test suites. Two different philosophies for evaluating AI agent output.

TL;DR

Iris is an MCP server that scores every agent output inline in production — zero code changes, deterministic rules, sub-millisecond overhead. DeepEval is a Python testing framework that runs LLM-as-Judge evaluation suites via pytest — powerful semantic metrics, designed for CI/CD pipelines. If you want every production output scored automatically, Iris. If you want deep semantic evaluation in your test pipeline, DeepEval.

For background on heuristic vs semantic evaluation, see our evaluation methodology guide.

Feature Comparison

Side by side.

FeatureIrisDeepEval
Eval approachDual: deterministic rules (<1ms, free) + LLM-as-Judge (v0.4, 5 templates, cost-capped)LLM-as-Judge metrics (semantic, slower)
Integration methodMCP config (zero code)Python pytest decorators
When eval runsInline, every output in productionOffline, batch test suites in CI/CD
LanguageTypeScript (any MCP agent)Python only
Self-hostingSingle binary, one SQLite filepip install, local execution
Built-in metrics13 deterministic rules + 5 LLM-judge templates (accuracy, helpfulness, safety, correctness, faithfulness) + semantic citation verification14+ metrics (faithfulness, hallucination, bias, toxicity)
Citation verificationSSRF-guarded source fetch + per-claim LLM verdict (v0.4)Not included
Custom metricsZod schema custom rules + programmatic MCP deploy_rulePython custom metrics class
Cost trackingPer-trace USD cost + per-LLM-judge-eval cost + aggregate visibilityNot included
DashboardReal-time dark-mode UI with Decision Moments + drift detectionConfident AI cloud dashboard (separate product)
MCP supportProtocol-native (IS an MCP server; 9 tools)Not MCP-aware
OpenTelemetry exportOTLP/HTTP JSON to Jaeger/Tempo/Datadog (v0.4)Not included
Supply-chain integritySBOM + cosign + SLSA build-provenance (v0.4)Standard pip
LicenseMITApache 2.0
MaturityEarly stage (v0.4.0)Established (14K+ GitHub stars)

Decision Guide

Which one fits your workflow?

When to choose Iris

  • You want eval running on every output in production, not just in test suites
  • You're building with MCP-compatible agents and want zero-code integration
  • You need cost tracking and aggregate spend visibility across agents
  • You want a single self-hosted binary with no Python dependency
  • You want both deterministic (fast, free) and LLM-as-Judge (semantic, cost-capped) in the same tool surface
  • You need semantic citation verification — checking whether cited sources actually support the claim
  • You want OpenTelemetry export to your existing Jaeger / Tempo / Datadog stack

When to choose DeepEval

  • You're building in Python and want pytest-native eval workflows
  • You want to run eval suites in CI/CD pipelines before deployment
  • You need a mature ecosystem with extensive documentation and community
  • You want the Confident AI cloud platform for team collaboration

Last verified: March 2026. This comparison is based on publicly available documentation and may not reflect recent changes to DeepEval. We aim to keep this page accurate and fair.

See something outdated or incorrect? Report an inaccuracy — we review and update within 48 hours.

Ready to see what your agents are doing?

Add Iris to your MCP config. First trace in 60 seconds. No SDK, no signup, no infrastructure.