What is the difference between Iris and Braintrust?

Iris is an MCP-native agent eval tool that requires zero code changes — your agent discovers it automatically via MCP config. Braintrust is a comprehensive SDK-based eval platform with datasets, experiments, prompt playground, and CI-integrated regression testing. Iris focuses on zero-code simplicity with heuristic eval rules, while Braintrust offers deeper evaluation workflows with LLM scoring and human review.

Is Iris better than Braintrust for MCP agent evaluation?

For MCP-compatible agents, Iris provides protocol-native integration with zero SDK overhead and free open-source licensing. Braintrust requires SDK imports but offers powerful experimentation workflows, dataset management, and multi-language support. The best choice depends on whether you need zero-code MCP simplicity or a full SDK-based eval and experimentation platform.

Comparison

Iris vs Braintrust

MCP-native, zero-code observability vs SDK-powered eval and experimentation platform. Two different philosophies for AI quality.

TL;DR

Iris is an MCP server your agent discovers and uses automatically — zero code changes, zero SDK imports, one SQLite file for storage. Eval runs locally with sub-millisecond heuristic rules. Braintrustis a comprehensive eval and observability platform with powerful dataset management, experiment tracking, a prompt playground, and deep tracing. If you're building with MCP-compatible agents and want the simplest possible setup with local eval, Iris gets you there in 60 seconds. If you need production-grade experimentation workflows, human review, or CI-integrated regression testing, Braintrust is the deeper eval platform.

For background on agent evaluation methodology, see our agent eval guide.

Feature Comparison

Side by side.

Feature	Iris	Braintrust
Integration method	MCP config (zero code)	SDK imports (Python, TS, Go, Ruby, C#)
Self-hosting	Single SQLite file	Enterprise plan only (cloud-first)
Performance overhead	Zero (no SDK in hot path)	Async logging, minimal overhead
Eval approach	12 built-in + 8 custom heuristic rules (<1ms)	LLM, code, and human scoring + datasets + experiments
Prompt playground	Not included	Full playground with side-by-side comparison
Datasets & experiments	Not included	Production traces to datasets, experiment tracking, CI integration
Cost tracking	Per-trace USD cost	Per-trace cost, per-user/feature/model breakdowns
MCP support	Protocol-native (IS an MCP server)	MCP server for querying Braintrust data
License	MIT (fully permissive)	Proprietary (proxy is MIT)
Pricing	Free & open-source	Free tier (1M spans) / Pro $249/mo / Enterprise custom
Tracing depth	MCP tool calls and agent traces	Full trace trees with token-level detail, visual timeline
Enterprise features	Roadmap (v0.5)	SOC 2, SSO, hybrid deployment, dedicated support

Decision Guide

Which one fits your stack?

When to choose Iris

You're building with MCP-compatible agents (Claude Desktop, Cursor, Windsurf)
You want zero-code integration — no SDK imports, no wrapper functions
You want simple self-hosting — one binary, one SQLite file, no cloud dependency
You want fully permissive MIT licensing with no proprietary modules
You want sub-millisecond heuristic eval that runs locally without LLM calls
You want to avoid per-seat or usage-based pricing

When to choose Braintrust

You need deep eval capabilities — datasets, experiments, human review, LLM scoring
You need a prompt playground for iterating on prompts with real data
You need enterprise compliance today (SOC 2, SSO, hybrid deployment)
You need multi-language SDK support (Python, TypeScript, Go, Ruby, C#)
You need granular cost analytics sliced by user, feature, or model
You need CI/CD-integrated regression testing against production datasets

Last verified: March 2026. This comparison is based on publicly available documentation and may not reflect recent changes to Braintrust. We aim to keep this page accurate and fair.

See something outdated or incorrect? Report an inaccuracy — we review and update within 48 hours.

Iris vs Braintrust

Side by side.

Which one fits your stack?

When to choose Iris

When to choose Braintrust

Ready to see what your agents are doing?