v0.4.0Iris v0.4 — LLM-as-Judge + citation verify + OTel + 9 MCP tools
Open Source
Glama ScoreCursor Directorynpm versionnpm downloadsGitHub starsCI statusMIT License

See what your AI agents are

actually shipping.

The agent eval standard for MCP. Install once. Every agent auto-discovers it. Zero SDK. Decision Moments classify what matters, so safety violations and cost spikes surface before happy-path passes.

$ npx @iris-eval/mcp-server
13 rules · regression-protected CI5 real-domain case studies · executed end-to-end through MCP249 tests · all green
Iris Dashboard — localhost:3838
Agents: 5Traces: 1,247Cost (7d): $127.43
Total Traces
1,247
+12%
Avg Score
0.84
+0.03
Total Cost
$127.43
+8%
PII Alerts
3
-2
Recent TracesLast 24 hours
research-agentpass0.94$0.122.3s5
code-review-botpass0.87$0.041.1s3
support-agentfail0.32$0.474.8s7
data-pipelinepass0.91$0.080.6s2
content-writerwarn0.62$0.213.4s4

Works with any MCP-compatible agent

Claude DesktopCursorClaude CodeWindsurfLangChainCrewAIMCP SDKAutoGenClaude DesktopCursorClaude CodeWindsurfLangChainCrewAIMCP SDKAutoGen

The Problem

Your agents pass every health check.

Infrastructure monitoring tells you the request succeeded. It cannot tell you the answer was wrong. Your agents need a quality gate — something that scores every output for safety, accuracy, and cost before it reaches a user.

What your APM sees
Status200 OK
Latency143ms
Memory245 MB
CPU12%
Throughput847 req/min
HealthAll systems operational
What Iris sees
PII Detected
SSN pattern in output (***-**-6789)
Injection Risk
Prompt manipulation attempt detected
Cost: $0.47 / query
4.7x over $0.10 threshold
Hallucination Markers
"As an AI language model" in output
Tool call #3 error
database_lookup timed out (30s)
Quality Score
0.32 / 1.0 — FAIL

Product

Nine tools. One quality standard.

Iris registers as an MCP server. Your agent discovers it and invokes its tools automatically. No SDK. No code changes.

Every execution. Every tool call. Every token.

log_trace captures full agent runs with hierarchical spans, per-tool-call latency, token usage, and cost in USD.

  • Hierarchical span tree with OpenTelemetry-compatible span kinds
  • Per-tool-call latency tracking
  • Token usage breakdown (prompt, completion, total)
  • Arbitrary metadata for custom attribution
Span Tree
AGENTresearch-agent2.3s
├─LLMsystem_prompt0.1s
├─TOOLweb_search0.8s
├─LLMsummarize_results0.4s
├─TOOLdatabase_query0.3s
├─LLMfinal_response0.7s

Built for

Three problems. One MCP server.

Every team building AI agents hits the same walls. Iris was built to tear them down — without touching your code.

Developers shipping MCP agents

You deployed an agent and you have no idea what it's doing.

Iris traces every execution, tool call, and token automatically. No SDK. No code changes. Add it to your MCP config and start seeing everything.

60s
to first trace
Teams monitoring agent costs

Your agent burned $0.47 on a single query and your APM showed 200 OK.

Iris tracks cost per trace, per agent, per time window. Set budget thresholds and get flagged when agents overspend — before finance finds out.

$0.07avg cost visibility per trace
Companies preventing PII leaks

Your agent leaked a Social Security number in its output and nobody noticed for 3 months.

Iris evaluates every output against 13 built-in rules including PII detection across 10 patterns (SSN, credit card, phone, email, IBAN, DOB, medical record number, IP address, API key, passport), prompt injection (13 patterns), stub-output detection, and hallucination markers. Real-time, every trace.

13built-in eval rules

Join the community

9
MCP tools
Log, evaluate, query, deploy/delete rules, delete traces, LLM judge (BYOK), citation verify (BYOK)
13
Built-in eval rules
Completeness, relevance, safety, cost
1
Eval latency
Heuristic rules. Fast and deterministic.
0
Lines of code to integrate
Add to MCP config. You're done.

Open Source — Free Forever to Self-Host

60 seconds to first trace.

Install Iris locally and start seeing what your agents are doing. Works with Claude Desktop, Cursor, Windsurf, or any MCP-compatible agent. Free, MIT-licensed, your data stays on your machine.

claude_desktop_config.json
{
  "mcpServers": {
    "iris-eval": {
      "command": "npx",
      "args": ["@iris-eval/mcp-server"]
    }
  }
}
Terminal
$ npm install -g @iris-eval/mcp-server
$ iris-mcp --dashboard
✓ Dashboard running at http://localhost:3838
Cursor
Install Iris in Cursor

One-click install for Cursor IDE.
No config file needed.

Pricing

Free to self-host. Cloud when you're ready.

The open-source core is MIT licensed with no limits. The cloud adds team dashboards, alerting, and managed infrastructure — starting free.

Open Source

Self-Hosted

$0forever

Everything you need to evaluate your MCP agents in production. Your machine, your data, your eval rules.

  • 9 MCP tools — full lifecycle + LLM judge + semantic citation verify (SSRF-guarded)
  • LLM-as-judge + citation verify use your own Anthropic/OpenAI API key (BYOK, no proxy)
  • 13 built-in eval rules + custom rules
  • Web dashboard with trace visualization
  • SQLite storage — zero infrastructure
  • Production security (auth, rate limiting)
  • Cost tracking per trace
  • Docker + npm + npx install
  • Community support (GitHub + Discord)
FreeComing Soon

Cloud Starter

$0/month

Run evaluations in the cloud with no commitment. Same eval engine, managed for you. No credit card.

  • Everything in Self-Hosted, plus:
  • 10,000 evaluations / month
  • 7-day eval history
  • 1 team member
  • Managed PostgreSQL
  • Personal dashboard
  • No credit card required
Most PopularComing Soon

Cloud Pro

$49/month

For teams that need shared eval results, alerting on quality regressions, and room to scale.

  • Everything in Starter, plus:
  • 25,000 evaluations included
  • $0.005 per additional evaluation
  • 90-day eval history
  • Unlimited team members
  • Team dashboards with shared views
  • Alerting (webhook + email)
  • API key management
  • CSV / JSON data export
  • Priority support
CustomComing Soon

Enterprise

Custom

For organizations that need audit-grade evaluation records, compliance, and dedicated support.

  • Everything in Pro, plus:
  • SSO / SAML (Okta, Azure AD, Google)
  • RBAC with custom roles
  • Audit logs with export
  • SOC 2 Type II documentation
  • Custom retention policies
  • SLA with uptime guarantee
  • Dedicated support + onboarding
  • EU AI Act compliance support

All plans include unlimited eval rules, both transports (stdio + HTTP), and full API access.
Waitlist members get founding-member pricing and a direct line to shape the roadmap.

Get early access to Iris Cloud

No spam. One email when the cloud tier launches.

I kept running into the same problem building AI agents: once they're running, you have no visibility into what they're actually doing. Traditional monitoring tells you the request succeeded. It can't tell you the agent leaked PII, hallucinated an answer, or burned through your budget on a single query.

So I built Iris — an MCP server that any agent discovers and uses automatically. No SDK. No code changes. Just add it to your config and start seeing everything.

Ian Parent
Founder & Builder

Roadmap

Built in public. Shipping fast.

v0.1Released

Core MCP Server

3 tools, initial 12-rule library, SQLite storage, web dashboard, production security

v0.2Released

Eval Sensitivity + Security Hardening

Smart rule exclusion, configurable thresholds, SQL whitelist, CSP headers, accessibility

v0.3Released

Dashboard Phase-1 + Pricing

OKLCH palette, dark/light theme, trace-ID copy, eval sparkline, pricing page, MCP-native validation harness

v0.3.1Released

Rule Library Expansion

13 eval rules (added no_stub_output), 10 PII patterns (IBAN, DOB, MRN, IP, API key, passport), 13 injection patterns, fabricated-citation heuristic, 55-case CI regression gate

v0.4Planned

LLM-as-Judge + Citation Verify + OTel + 9-tool MCP Surface

9 MCP tools — full rule + trace lifecycle + LLM-as-judge + SSRF-guarded citation verification (list_rules, deploy_rule, delete_rule, delete_trace, evaluate_with_llm_judge, verify_citations added); LLM-as-judge eval (Claude/GPT-4o, cost-capped, 5 prompt templates); semantic citation verification (4 citation kinds — numbered/author-year/URL/DOI — SSRF-guarded fetch + per-claim LLM verdict); OpenTelemetry export; tenant-id scaffolding; SBOM + cosign signing; Playwright E2E; Lighthouse CI; v2.C chrome polish

v0.5Planned

Cloud Tier

Managed Iris — PostgreSQL adapter, full multi-tenancy with user accounts + workspace isolation, team eval dashboards, usage-based billing

v0.6Planned

Alerting & Retention

Alert rules, webhooks, email notifications, retention policies, drift detection

v0.7Planned

Enterprise

SSO/SAML, RBAC, audit logs export, SOC 2 compliance