Blog — Iris

April 14, 2026·Ian Parent

Closing the Eval Gap: From Lenient Defaults to Signal That Matters

Default eval thresholds are designed to catch catastrophe, not degradation. Here's how configurable thresholds and smarter rule exclusion turn your evals from rubber stamps into real quality gates.

eval-gapagent-evalthresholdsquality

April 3, 2026·Ian Parent

Output Quality Score: The Single Number That Tells You If Your Agent Is Good Enough

Output Quality Score (OQS) is a composite metric that rolls completeness, relevance, safety, and cost into one number — giving teams a single quality signal for every agent output.

output-quality-scoreoqsagent-evalquality

March 31, 2026·Ian Parent

Self-Calibrating Eval: The End of Manual Threshold Tuning

Static eval thresholds break over time. Self-calibrating eval is the pattern where the system monitors its own scoring distribution and recommends adjustments — always human-approved.

self-calibrating-evaleval-advisoreval-driftagent-eval

March 29, 2026·Ian Parent

The Eval Loop: Why Evals Are the Loss Function for Agent Quality

Most teams treat eval as a one-time gate. The real pattern is a continuous loop: score, diagnose, calibrate, re-score. This is the eval loop — and it changes how you build agents.

eval-loopagent-evalqualitycalibration

March 29, 2026·Ian Parent

Why On-Chain Agent Actions Need Pre-Flight Eval

On-chain actions are irreversible. With 250K daily AI agents on blockchain and $3.4B stolen in 2025, real-time pre-execution eval isn't optional — it's the missing safety layer between agent decision and permanent consequence.

cryptodefiblockchainagent-eval

March 28, 2026·Ian Parent

Eval-Driven Development: Write the Rules Before the Prompt

Eval-Driven Development applies TDD principles to AI agents: define eval rules before prompts, iterate on scores, ship when rules pass.

eddeval-driven-developmentagent-evaltdd

March 26, 2026·Ian Parent

Eval Coverage: The Metric Your AI Agents Are Missing

Eval coverage measures the percentage of agent executions that receive evaluation. Most teams are at 0%. Here's why 100% is the only target.

eval-coverageagent-evaltestingmcp

March 24, 2026·Ian Parent

The Eval Gap: Why Your AI Demo Works and Production Doesn't

The eval gap is why your AI demo works but production fails. Learn the four mechanisms that create it and how inline evaluation closes it.

eval-gapagent-evalproductionmcp

March 22, 2026·Ian Parent

Eval Drift: The Silent Quality Killer for AI Agents

Eval drift is the silent degradation of agent quality caused by upstream model changes you can't control. Learn how to detect and prevent it.

eval-driftagent-evalqualitymcp

March 21, 2026·Ian Parent

The AI Eval Tax: The Hidden Cost Every Agent Team Is Paying

The eval tax is the compounding cost of every unscored agent output — in trust, engineering hours, and liability. Here's how to stop paying.

eval-taxagent-evalproductioncost

March 17, 2026·Ian Parent

MCP Meets OpenTelemetry: Bridging Agent Observability and Infrastructure Monitoring

How Iris bridges agent observability and infrastructure monitoring by exporting MCP traces as OpenTelemetry spans to Datadog and Grafana.

opentelemetryobservabilitymcpinfrastructure

March 17, 2026·Ian Parent

Agent Errors vs Application Errors: Why Your Error Tracker Can't See AI Failures

Why Sentry and Bugsnag can't detect hallucinations, PII leaks, or prompt injection — and what agent-level error tracking looks like.

observabilityagentserror-trackingeval

March 16, 2026·Ian Parent

MCP Observability is the New APM

MCP observability is following the same adoption curve as APM — and teams without agent-native monitoring will face the same reckoning.

observabilityapmmcpagents

March 16, 2026·Ian Parent

How to Evaluate AI Agent Output Without Calling Another LLM

A step-by-step tutorial for evaluating AI agent output using deterministic heuristic rules — no LLM-as-Judge, no added cost, sub-millisecond.

evalagentsmcptutorial

March 16, 2026·Ian Parent

Toward an MCP Observability Specification

A proposal for standardizing MCP observability with trace schemas, eval interfaces, and cost metadata to prevent ecosystem fragmentation.

mcpobservabilityspecificationprotocol

March 15, 2026·Ian Parent

Why Every MCP Agent Needs an Independent Observer

Why self-reported agent logs are structurally untrustworthy and how MCP enables architecturally independent observability for AI agents.

observabilityagentsmcparchitecture

March 15, 2026·Ian Parent

The Cost of Invisible Agents: What $0.47 Per Query Looks Like at Scale

How invisible token costs compound to $14,000 monthly bills when agents lack per-trace cost tracking and budget threshold enforcement.

costobservabilityagentsmcp

March 15, 2026·Ian Parent

Heuristic vs Semantic Eval: When <1ms Matters More Than LLM-as-Judge

When sub-millisecond heuristic eval rules outperform LLM-as-Judge for PII detection, prompt injection, and cost threshold enforcement.

evaluationheuristicllm-as-judgeperformance

March 14, 2026·Ian Parent

The State of MCP Agent Observability (March 2026)

A comprehensive analysis of the MCP agent observability landscape in 2026, covering market trends, security gaps, and eval approaches.

observabilitymcpagentsreport

March 13, 2026·Ian Parent

Why Your AI Agents Need Observability

Learn why traditional APM fails for AI agents and how MCP-native observability with Iris provides the tracing and evaluation agents need.

observabilityagentsmcpevaluation

Research & insights

Closing the Eval Gap: From Lenient Defaults to Signal That Matters

Output Quality Score: The Single Number That Tells You If Your Agent Is Good Enough

Self-Calibrating Eval: The End of Manual Threshold Tuning

The Eval Loop: Why Evals Are the Loss Function for Agent Quality

Why On-Chain Agent Actions Need Pre-Flight Eval

Eval-Driven Development: Write the Rules Before the Prompt

Eval Coverage: The Metric Your AI Agents Are Missing

The Eval Gap: Why Your AI Demo Works and Production Doesn't

Eval Drift: The Silent Quality Killer for AI Agents

The AI Eval Tax: The Hidden Cost Every Agent Team Is Paying

MCP Meets OpenTelemetry: Bridging Agent Observability and Infrastructure Monitoring

Agent Errors vs Application Errors: Why Your Error Tracker Can't See AI Failures

MCP Observability is the New APM

How to Evaluate AI Agent Output Without Calling Another LLM

Toward an MCP Observability Specification

Why Every MCP Agent Needs an Independent Observer

The Cost of Invisible Agents: What $0.47 Per Query Looks Like at Scale

Heuristic vs Semantic Eval: When <1ms Matters More Than LLM-as-Judge

The State of MCP Agent Observability (March 2026)

Why Your AI Agents Need Observability