Open Source

See what your AI agents are

actually shipping.

The agent eval standard for MCP. Install once. Every agent auto-discovers it. Zero SDK. Decision Moments classify what matters, so safety violations and cost spikes surface before happy-path passes.

Get Started Join Cloud Waitlist

$ npx @iris-eval/mcp-server

13 rules · regression-protected CI5 real-domain case studies · executed end-to-end through MCP424 tests · all green

Iris Dashboard — localhost:3838

Example dataAgents: 5Traces: 1,247Cost (7d): $127.43

Total Traces

1,247

+12%

Avg Score

0.84

+0.03

Total Cost

$127.43

+8%

PII Alerts

-2

Recent TracesLast 24 hours

AgentStatusScoreCostLatencyTools

research-agentpass0.94$0.122.3s5

code-review-botpass0.87$0.041.1s3

support-agentfail0.32$0.474.8s7

data-pipelinepass0.91$0.080.6s2

content-writerwarn0.62$0.213.4s4

Works with any MCP-compatible agent

Claude DesktopCursorClaude CodeWindsurfLangChainCrewAIMCP SDKAutoGenClaude DesktopCursorClaude CodeWindsurfLangChainCrewAIMCP SDKAutoGen

The Problem

Your agents pass every health check.

Infrastructure monitoring tells you the request succeeded. It cannot tell you the answer was wrong. Your agents need a quality gate — something that scores every output for safety, accuracy, and cost before it reaches a user.

What your APM sees

Status200 OK

Latency143ms

Memory245 MB

CPU12%

Throughput847 req/min

HealthAll systems operational

What Iris sees

PII Detected

SSN pattern in output (***-**-6789)

Injection Risk

Prompt manipulation attempt detected

Cost: $0.47 / query

4.7x over $0.10 threshold

Hallucination Markers

"As an AI language model" in output

Tool call #3 error

database_lookup timed out (30s)

Quality Score

0.32 / 1.0 — FAIL

Product

Nine tools. One quality standard.

Iris registers as an MCP server. Your agent discovers it and invokes its tools automatically. No SDK. No code changes.

Every execution. Every tool call. Every token.

log_trace captures full agent runs with hierarchical spans, per-tool-call latency, token usage, and cost in USD.

Hierarchical span tree with OpenTelemetry-compatible span kinds
Per-tool-call latency tracking
Token usage breakdown (prompt, completion, total)
Arbitrary metadata for custom attribution

Span Tree

AGENTresearch-agent2.3s

├─LLMsystem_prompt0.1s

├─TOOLweb_search0.8s

├─LLMsummarize_results0.4s

├─TOOLdatabase_query0.3s

├─LLMfinal_response0.7s

Built for

Three problems. One MCP server.

Every team building AI agents hits the same walls. Iris was built to tear them down — without touching your code.

Developers shipping MCP agents

“You deployed an agent and you have no idea what it's doing.”

Iris traces every execution, tool call, and token automatically. No SDK. No code changes. Add it to your MCP config and start seeing everything.

60s

to first trace

Teams monitoring agent costs

“Your agent burned $0.47 on a single query and your APM showed 200 OK.”

Iris tracks cost per trace, per agent, per time window. Set budget thresholds and get flagged when agents overspend — before finance finds out.

$0.07avg cost visibility per trace

Companies preventing PII leaks

“Your agent leaked a Social Security number in its output and nobody noticed for 3 months.”

Iris evaluates every output against 13 built-in rules including PII detection across 10 patterns (SSN, credit card, phone, email, IBAN, DOB, medical record number, IP address, API key, passport), prompt injection (13 patterns), stub-output detection, and hallucination markers. Real-time, every trace.

13built-in eval rules

Join the community

Star on GitHubDiscord (coming soon)Follow @iris_eval View on npm Cursor Directory

MCP tools

Log, evaluate, query, deploy/delete rules, delete traces, LLM judge (BYOK), citation verify (BYOK)

Built-in eval rules

Completeness, relevance, safety, cost

Eval latency

Heuristic rules. Fast and deterministic.

Lines of code to integrate

Add to MCP config. You're done.

Open Source — Free Forever to Self-Host

60 seconds to first trace.

Install Iris locally and start seeing what your agents are doing. Works with Claude Desktop, Cursor, Windsurf, or any MCP-compatible agent. Free, MIT-licensed, your data stays on your machine.

claude_desktop_config.json

{
  "mcpServers": {
    "iris-eval": {
      "command": "npx",
      "args": ["@iris-eval/mcp-server"]
    }
  }
}

Terminal

$ npm install -g @iris-eval/mcp-server
$ iris-mcp --dashboard
✓ Dashboard running at http://localhost:3838

Cursor

One-click install for Cursor IDE.
No config file needed.

Pricing

Free to self-host. Cloud when you're ready.

The open-source core is MIT licensed with no limits. The cloud adds team dashboards, alerting, and managed infrastructure — starting free.

Open Source

Self-Hosted

$0forever

Everything you need to evaluate your MCP agents in production. Your machine, your data, your eval rules.

Get Started

9 MCP tools — full lifecycle + LLM judge + semantic citation verify (SSRF-guarded)
LLM-as-judge + citation verify use your own Anthropic/OpenAI API key (BYOK, no proxy)
13 built-in eval rules + custom rules
Web dashboard with trace visualization
SQLite storage — zero infrastructure
Production security (auth, rate limiting)
Cost tracking per trace
Docker + npm + npx install
Community support (GitHub + Discord)

FreeComing Soon

Cloud Starter

$0/month

Run evaluations in the cloud with no commitment. Same eval engine, managed for you. No credit card.

Join Waitlist

Everything in Self-Hosted, plus:
10,000 evaluations / month
7-day eval history
1 team member
Managed PostgreSQL
Personal dashboard
No credit card required

Most PopularComing Soon

Cloud Pro

$49/month

For teams that need shared eval results, alerting on quality regressions, and room to scale.

Join Waitlist

Everything in Starter, plus:
25,000 evaluations included
$0.005 per additional evaluation
90-day eval history
Unlimited team members
Team dashboards with shared views
Alerting (webhook + email)
API key management
CSV / JSON data export
Priority support

CustomComing Soon

Enterprise

Custom

For organizations that need audit-grade evaluation records, compliance, and dedicated support.

Everything in Pro, plus:
SSO / SAML (Okta, Azure AD, Google)
RBAC with custom roles
Audit logs with export
SOC 2 Type II documentation
Custom retention policies
SLA with uptime guarantee
Dedicated support + onboarding
EU AI Act compliance support

All plans include unlimited eval rules, both transports (stdio + HTTP), and full API access.
Waitlist members get founding-member pricing and a direct line to shape the roadmap.

Get early access to Iris Cloud

No spam. One email when the cloud tier launches.

“
I kept running into the same problem building AI agents: once they're running, you have no visibility into what they're actually doing. Traditional monitoring tells you the request succeeded. It can't tell you the agent leaked PII, hallucinated an answer, or burned through your budget on a single query.
So I built Iris — an MCP server that any agent discovers and uses automatically. No SDK. No code changes. Just add it to your config and start seeing everything.
Ian Parent
Founder & Builder

Research

Publications and insights.

Original research on MCP agent observability, evaluation methodology, and the evolving landscape of AI agent infrastructure.

ReportMarch 2026

The State of MCP Agent Observability

The gap between deploying AI agents and understanding what they're doing. Covers protocol-native observability, heuristic vs. semantic eval, cost visibility, and EU AI Act implications.

Read report

BlogMarch 2026

Why Your AI Agents Need Observability

AI agents fail silently. Traditional monitoring can't see the difference between a correct response and a hallucinated one. Why protocol-native observability changes the equation.

Read post

View all posts

Coming Soon

MCP Agent Observability Survey 2026

We're collecting data on how teams evaluate, monitor, and track costs for AI agents in production.

Roadmap

Built in public. Shipping fast.

v0.1Released

Core MCP Server

3 tools, initial 12-rule library, SQLite storage, web dashboard, production security

v0.2Released

Eval Sensitivity + Security Hardening

Smart rule exclusion, configurable thresholds, SQL whitelist, CSP headers, accessibility

v0.3Released

Dashboard Phase-1 + Pricing

OKLCH palette, dark/light theme, trace-ID copy, eval sparkline, pricing page, MCP-native validation harness

v0.3.1Released

Rule Library Expansion

13 eval rules (added no_stub_output), 10 PII patterns (IBAN, DOB, MRN, IP, API key, passport), 13 injection patterns, fabricated-citation heuristic, 55-case CI regression gate

v0.4Released

LLM-as-Judge + Citation Verify + OTel + 9-tool MCP Surface

9 MCP tools — full rule + trace lifecycle + LLM-as-judge + SSRF-guarded citation verification (list_rules, deploy_rule, delete_rule, delete_trace, evaluate_with_llm_judge, verify_citations added); LLM-as-judge eval (Claude/GPT-4o, cost-capped, 5 prompt templates); semantic citation verification (4 citation kinds — numbered/author-year/URL/DOI — SSRF-guarded fetch + per-claim LLM verdict); OpenTelemetry export; tenant-id scaffolding; SBOM + cosign signing; Playwright E2E; Lighthouse CI; v2.C chrome polish

v0.5Planned

Cloud Tier

Managed Iris — PostgreSQL adapter, full multi-tenancy with user accounts + workspace isolation, team eval dashboards, usage-based billing

v0.6Planned

Alerting & Retention

Alert rules, webhooks, email notifications, retention policies, drift detection

v0.7Planned

Enterprise

SSO/SAML, RBAC, audit logs export, SOC 2 compliance