Blog
Original research on MCP agent observability, evaluation methodology, and the evolving landscape of AI agent infrastructure.
I have spent most of my career trusting error trackers. A TypeError fires, Sentry catches it, I get a Slack notification with a stack trace and breadcrumbs, and...
There are two worlds in production observability right now, and they do not talk to each other.
The Model Context Protocol defines how agents discover and invoke tools. It defines resources, prompts, and transport mechanisms. It standardizes the interface ...
Here is the default approach to evaluating agent output in 2026: take the output, send it to another LLM, ask that LLM to judge quality, and trust the result.
In 2010, application performance monitoring was a nice-to-have. Engineering teams shipped to production, watched their server logs, and hoped for the best. Moni...
There is a default assumption in the agent eval space right now: if you want to evaluate agent output, you need an LLM to judge it. Feed the output to GPT-4o wi...
Last month I got a message from a developer running a research agent in production. His APM dashboard looked fine. HTTP 200s across the board. P99 latency under...
There is a sentence I keep coming back to. I first saw it from @aginaut on X:
The gap between deploying AI agents and understanding what they're doing.
You shipped an AI agent. It works... sometimes. A user reports a wrong answer. Another says it took 40 seconds. A third notices it leaked an email address in it...