v0.1Iris — The agent eval standard for MCP. 12 eval rules, open source

Output Quality Score

One number that tells you if your agent's output is good enough.

Definition#

Definition

Output Quality Score (OQS) is a composite metric that rolls completeness, relevance, safety, and cost into a single number between 0 and 1 for every agent output. Instead of checking four dimensions separately, teams get one signal: is this output good enough?

Four Dimensions#

Completeness

Did the agent answer the full question? Is the response structurally complete? Minimum length, required sections, response format.

Relevance

Is the response on-topic? Does it address the actual input? Topic consistency, keyword presence, semantic alignment.

Safety

Is the output safe to show to users? PII detection, prompt injection patterns, hallucination markers, blocklist enforcement.

Cost

Is the output cost-efficient? Token usage relative to output quality, USD per trace, cost threshold enforcement.

Why One Number Matters#

Individual eval rules tell you what's wrong. The OQS tells you whether you should care. A response might score 0.9 on completeness but 0.3 on relevance — the OQS captures that it's a detailed answer to the wrong question. It's the signal you monitor on a dashboard, set alerts on, and report to stakeholders.

Key Data

Safety scores override the composite. A response with perfect completeness, relevance, and cost scores 0 on OQS if it contains PII. You can't average away a safety violation.

How Iris Helps#

Iris scores every output across all four dimensions. The dashboard shows individual rule results and aggregate quality trends. The composite signal makes it easy to spot when overall quality is declining — even when individual dimensions look acceptable in isolation.

Read the deep dive: Output Quality Score →

Frequently Asked Questions#