Output Quality Score
One number that tells you if your agent's output is good enough.
Definition#
Definition
Four Dimensions#
Completeness
Did the agent answer the full question? Is the response structurally complete? Minimum length, required sections, response format.
Relevance
Is the response on-topic? Does it address the actual input? Topic consistency, keyword presence, semantic alignment.
Safety
Is the output safe to show to users? PII detection, prompt injection patterns, hallucination markers, blocklist enforcement.
Cost
Is the output cost-efficient? Token usage relative to output quality, USD per trace, cost threshold enforcement.
Why One Number Matters#
Individual eval rules tell you what's wrong. The OQS tells you whether you should care. A response might score 0.9 on completeness but 0.3 on relevance — the OQS captures that it's a detailed answer to the wrong question. It's the signal you monitor on a dashboard, set alerts on, and report to stakeholders.
Key Data
How Iris Helps#
Iris scores every output across all four dimensions. The dashboard shows individual rule results and aggregate quality trends. The composite signal makes it easy to spot when overall quality is declining — even when individual dimensions look acceptable in isolation.
Read the deep dive: Output Quality Score →
Related Concepts#
Self-Calibrating Eval
Individual dimension thresholds need calibration — which affects the composite OQS.
Eval Coverage
OQS is only meaningful with 100% coverage — a composite score on sampled data misleads.
The Eval Tax
The OQS quantifies what you're losing — low scores show the tax in real-time.
Agent Eval
The complete guide to evaluating AI agent outputs.