v0.4.0Iris v0.4 — LLM-as-Judge + citation verify + OTel + 9 MCP tools

Blog

Research & insights

Original research on MCP agent observability, evaluation methodology, and the evolving landscape of AI agent infrastructure.

·Ian Parent

Closing the Eval Gap: From Lenient Defaults to Signal That Matters

Default eval thresholds are designed to catch catastrophe, not degradation. Here's how configurable thresholds and smarter rule exclusion turn your evals from rubber stamps into real quality gates.

eval-gapagent-evalthresholdsquality
·Ian Parent

Output Quality Score: The Single Number That Tells You If Your Agent Is Good Enough

Output Quality Score (OQS) is a composite metric that rolls completeness, relevance, safety, and cost into one number — giving teams a single quality signal for every agent output.

output-quality-scoreoqsagent-evalquality
·Ian Parent

Self-Calibrating Eval: The End of Manual Threshold Tuning

Static eval thresholds break over time. Self-calibrating eval is the pattern where the system monitors its own scoring distribution and recommends adjustments — always human-approved.

self-calibrating-evaleval-advisoreval-driftagent-eval
·Ian Parent

The Eval Loop: Why Evals Are the Loss Function for Agent Quality

Most teams treat eval as a one-time gate. The real pattern is a continuous loop: score, diagnose, calibrate, re-score. This is the eval loop — and it changes how you build agents.

eval-loopagent-evalqualitycalibration
·Ian Parent

Why On-Chain Agent Actions Need Pre-Flight Eval

On-chain actions are irreversible. With 250K daily AI agents on blockchain and $3.4B stolen in 2025, real-time pre-execution eval isn't optional — it's the missing safety layer between agent decision and permanent consequence.

cryptodefiblockchainagent-eval
·Ian Parent

Eval-Driven Development: Write the Rules Before the Prompt

Eval-Driven Development applies TDD principles to AI agents: define eval rules before prompts, iterate on scores, ship when rules pass.

eddeval-driven-developmentagent-evaltdd
·Ian Parent

Eval Coverage: The Metric Your AI Agents Are Missing

Eval coverage measures the percentage of agent executions that receive evaluation. Most teams are at 0%. Here's why 100% is the only target.

eval-coverageagent-evaltestingmcp
·Ian Parent

The Eval Gap: Why Your AI Demo Works and Production Doesn't

The eval gap is why your AI demo works but production fails. Learn the four mechanisms that create it and how inline evaluation closes it.

eval-gapagent-evalproductionmcp
·Ian Parent

Eval Drift: The Silent Quality Killer for AI Agents

Eval drift is the silent degradation of agent quality caused by upstream model changes you can't control. Learn how to detect and prevent it.

eval-driftagent-evalqualitymcp
·Ian Parent

The AI Eval Tax: The Hidden Cost Every Agent Team Is Paying

The eval tax is the compounding cost of every unscored agent output — in trust, engineering hours, and liability. Here's how to stop paying.

eval-taxagent-evalproductioncost
·Ian Parent

MCP Meets OpenTelemetry: Bridging Agent Observability and Infrastructure Monitoring

How Iris bridges agent observability and infrastructure monitoring by exporting MCP traces as OpenTelemetry spans to Datadog and Grafana.

opentelemetryobservabilitymcpinfrastructure
·Ian Parent

Agent Errors vs Application Errors: Why Your Error Tracker Can't See AI Failures

Why Sentry and Bugsnag can't detect hallucinations, PII leaks, or prompt injection — and what agent-level error tracking looks like.

observabilityagentserror-trackingeval
·Ian Parent

MCP Observability is the New APM

MCP observability is following the same adoption curve as APM — and teams without agent-native monitoring will face the same reckoning.

observabilityapmmcpagents
·Ian Parent

How to Evaluate AI Agent Output Without Calling Another LLM

A step-by-step tutorial for evaluating AI agent output using deterministic heuristic rules — no LLM-as-Judge, no added cost, sub-millisecond.

evalagentsmcptutorial
·Ian Parent

Toward an MCP Observability Specification

A proposal for standardizing MCP observability with trace schemas, eval interfaces, and cost metadata to prevent ecosystem fragmentation.

mcpobservabilityspecificationprotocol
·Ian Parent

Why Every MCP Agent Needs an Independent Observer

Why self-reported agent logs are structurally untrustworthy and how MCP enables architecturally independent observability for AI agents.

observabilityagentsmcparchitecture
·Ian Parent

The Cost of Invisible Agents: What $0.47 Per Query Looks Like at Scale

How invisible token costs compound to $14,000 monthly bills when agents lack per-trace cost tracking and budget threshold enforcement.

costobservabilityagentsmcp
·Ian Parent

Heuristic vs Semantic Eval: When <1ms Matters More Than LLM-as-Judge

When sub-millisecond heuristic eval rules outperform LLM-as-Judge for PII detection, prompt injection, and cost threshold enforcement.

evaluationheuristicllm-as-judgeperformance
·Ian Parent

The State of MCP Agent Observability (March 2026)

A comprehensive analysis of the MCP agent observability landscape in 2026, covering market trends, security gaps, and eval approaches.

observabilitymcpagentsreport
·Ian Parent

Why Your AI Agents Need Observability

Learn why traditional APM fails for AI agents and how MCP-native observability with Iris provides the tracing and evaluation agents need.

observabilityagentsmcpevaluation