How is the OQS calculated?

OQS combines individual eval rule scores using weighted aggregation. Each dimension (completeness, relevance, safety, cost) contributes based on configurable weights. A safety violation can override the composite regardless of other scores — you can't average away a PII leak.

There is no universal 'good' score — it depends on your use case. A customer-facing chatbot might require OQS \u003e 0.85. An internal research assistant might be fine at 0.6. The value of OQS is not the absolute number but the trend over time and the ability to compare across agents.

Output Quality Score

Q: What is the Output Quality Score?

The Output Quality Score (OQS) is a composite metric that rolls completeness, relevance, safety, and cost into a single number between 0 and 1 for every agent output. It gives teams one signal to answer 'is this output good enough?' instead of checking multiple dimensions separately.

One number that tells you if your agent's output is good enough.

Definition#

Definition

Output Quality Score (OQS) is a composite metric that rolls completeness, relevance, safety, and cost into a single number between 0 and 1 for every agent output. Instead of checking four dimensions separately, teams get one signal: is this output good enough?

Four Dimensions#

Completeness

Did the agent answer the full question? Is the response structurally complete? Minimum length, required sections, response format.

Relevance

Is the response on-topic? Does it address the actual input? Topic consistency, keyword presence, semantic alignment.

Safety

Is the output safe to show to users? PII detection, prompt injection patterns, hallucination markers, blocklist enforcement.

Cost

Is the output cost-efficient? Token usage relative to output quality, USD per trace, cost threshold enforcement.

Why One Number Matters#

Individual eval rules tell you what's wrong. The OQS tells you whether you should care. A response might score 0.9 on completeness but 0.3 on relevance — the OQS captures that it's a detailed answer to the wrong question. It's the signal you monitor on a dashboard, set alerts on, and report to stakeholders.

Key Data

Safety scores override the composite. A response with perfect completeness, relevance, and cost scores 0 on OQS if it contains PII. You can't average away a safety violation.

How Iris Helps#

Iris scores every output across all four dimensions. The dashboard shows individual rule results and aggregate quality trends. The composite signal makes it easy to spot when overall quality is declining — even when individual dimensions look acceptable in isolation.

Read the deep dive: Output Quality Score →

TERM

Self-Calibrating Eval

Individual dimension thresholds need calibration — which affects the composite OQS.

TERM

Eval Coverage

OQS is only meaningful with 100% coverage — a composite score on sampled data misleads.

TERM

The Eval Tax

The OQS quantifies what you're losing — low scores show the tax in real-time.

TERM

Agent Eval