See what your AI agents are
The agent eval standard for MCP. Install once. Every agent auto-discovers it. Zero SDK. Decision Moments classify what matters, so safety violations and cost spikes surface before happy-path passes.
Works with any MCP-compatible agent
The Problem
Your agents pass every health check.
Infrastructure monitoring tells you the request succeeded. It cannot tell you the answer was wrong. Your agents need a quality gate — something that scores every output for safety, accuracy, and cost before it reaches a user.
Product
Nine tools. One quality standard.
Iris registers as an MCP server. Your agent discovers it and invokes its tools automatically. No SDK. No code changes.
Every execution. Every tool call. Every token.
log_trace captures full agent runs with hierarchical spans, per-tool-call latency, token usage, and cost in USD.
- Hierarchical span tree with OpenTelemetry-compatible span kinds
- Per-tool-call latency tracking
- Token usage breakdown (prompt, completion, total)
- Arbitrary metadata for custom attribution
Built for
Three problems. One MCP server.
Every team building AI agents hits the same walls. Iris was built to tear them down — without touching your code.
“You deployed an agent and you have no idea what it's doing.”
Iris traces every execution, tool call, and token automatically. No SDK. No code changes. Add it to your MCP config and start seeing everything.
“Your agent burned $0.47 on a single query and your APM showed 200 OK.”
Iris tracks cost per trace, per agent, per time window. Set budget thresholds and get flagged when agents overspend — before finance finds out.
“Your agent leaked a Social Security number in its output and nobody noticed for 3 months.”
Iris evaluates every output against 13 built-in rules including PII detection across 10 patterns (SSN, credit card, phone, email, IBAN, DOB, medical record number, IP address, API key, passport), prompt injection (13 patterns), stub-output detection, and hallucination markers. Real-time, every trace.
Join the community
Open Source — Free Forever to Self-Host
60 seconds to first trace.
Install Iris locally and start seeing what your agents are doing. Works with Claude Desktop, Cursor, Windsurf, or any MCP-compatible agent. Free, MIT-licensed, your data stays on your machine.
Pricing
Free to self-host. Cloud when you're ready.
The open-source core is MIT licensed with no limits. The cloud adds team dashboards, alerting, and managed infrastructure — starting free.
Self-Hosted
Everything you need to evaluate your MCP agents in production. Your machine, your data, your eval rules.
- 9 MCP tools — full lifecycle + LLM judge + semantic citation verify (SSRF-guarded)
- LLM-as-judge + citation verify use your own Anthropic/OpenAI API key (BYOK, no proxy)
- 13 built-in eval rules + custom rules
- Web dashboard with trace visualization
- SQLite storage — zero infrastructure
- Production security (auth, rate limiting)
- Cost tracking per trace
- Docker + npm + npx install
- Community support (GitHub + Discord)
Cloud Starter
Run evaluations in the cloud with no commitment. Same eval engine, managed for you. No credit card.
- Everything in Self-Hosted, plus:
- 10,000 evaluations / month
- 7-day eval history
- 1 team member
- Managed PostgreSQL
- Personal dashboard
- No credit card required
Cloud Pro
For teams that need shared eval results, alerting on quality regressions, and room to scale.
- Everything in Starter, plus:
- 25,000 evaluations included
- $0.005 per additional evaluation
- 90-day eval history
- Unlimited team members
- Team dashboards with shared views
- Alerting (webhook + email)
- API key management
- CSV / JSON data export
- Priority support
Enterprise
For organizations that need audit-grade evaluation records, compliance, and dedicated support.
- Everything in Pro, plus:
- SSO / SAML (Okta, Azure AD, Google)
- RBAC with custom roles
- Audit logs with export
- SOC 2 Type II documentation
- Custom retention policies
- SLA with uptime guarantee
- Dedicated support + onboarding
- EU AI Act compliance support
All plans include unlimited eval rules, both transports (stdio + HTTP), and full API access.
Waitlist members get founding-member pricing and a direct line to shape the roadmap.
Get early access to Iris Cloud
No spam. One email when the cloud tier launches.
“I kept running into the same problem building AI agents: once they're running, you have no visibility into what they're actually doing. Traditional monitoring tells you the request succeeded. It can't tell you the agent leaked PII, hallucinated an answer, or burned through your budget on a single query.
So I built Iris — an MCP server that any agent discovers and uses automatically. No SDK. No code changes. Just add it to your config and start seeing everything.
Research
Publications and insights.
Original research on MCP agent observability, evaluation methodology, and the evolving landscape of AI agent infrastructure.
The State of MCP Agent Observability
The gap between deploying AI agents and understanding what they're doing. Covers protocol-native observability, heuristic vs. semantic eval, cost visibility, and EU AI Act implications.
Read reportWhy Your AI Agents Need Observability
AI agents fail silently. Traditional monitoring can't see the difference between a correct response and a hallucinated one. Why protocol-native observability changes the equation.
Read postMCP Agent Observability Survey 2026
We're collecting data on how teams evaluate, monitor, and track costs for AI agents in production.
Roadmap
Built in public. Shipping fast.
Core MCP Server
3 tools, initial 12-rule library, SQLite storage, web dashboard, production security
Eval Sensitivity + Security Hardening
Smart rule exclusion, configurable thresholds, SQL whitelist, CSP headers, accessibility
Dashboard Phase-1 + Pricing
OKLCH palette, dark/light theme, trace-ID copy, eval sparkline, pricing page, MCP-native validation harness
Rule Library Expansion
13 eval rules (added no_stub_output), 10 PII patterns (IBAN, DOB, MRN, IP, API key, passport), 13 injection patterns, fabricated-citation heuristic, 55-case CI regression gate
LLM-as-Judge + Citation Verify + OTel + 9-tool MCP Surface
9 MCP tools — full rule + trace lifecycle + LLM-as-judge + SSRF-guarded citation verification (list_rules, deploy_rule, delete_rule, delete_trace, evaluate_with_llm_judge, verify_citations added); LLM-as-judge eval (Claude/GPT-4o, cost-capped, 5 prompt templates); semantic citation verification (4 citation kinds — numbered/author-year/URL/DOI — SSRF-guarded fetch + per-claim LLM verdict); OpenTelemetry export; tenant-id scaffolding; SBOM + cosign signing; Playwright E2E; Lighthouse CI; v2.C chrome polish
Cloud Tier
Managed Iris — PostgreSQL adapter, full multi-tenancy with user accounts + workspace isolation, team eval dashboards, usage-based billing
Alerting & Retention
Alert rules, webhooks, email notifications, retention policies, drift detection
Enterprise
SSO/SAML, RBAC, audit logs export, SOC 2 compliance