How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
TL;DR
Agent observability is not just Datadog for LLM apps — Phil Hetzel says traditional observability asks whether a system is up, fast, and error-free, while agent observability also has to judge qualitative behavior like grounding, tool use, and brand alignment.
Non-determinism changes what you have to observe — classic apps follow known code paths, but agents can take many valid routes, so teams need broader visibility into why one reasoning path happened instead of another.
The trace shape is the real systems challenge — Braintrust has seen agent traces exceed 1 GB and individual spans hit 20 MB, which makes ingestion, indexing, and real-time querying far harder than standard observability pipelines.
Text search becomes a first-class observability feature — Hetzel highlights Braintrust’s use of a Tantivy-based full-text index so teams can query traces for terms like 'Amazon,' something traditional observability stacks rarely need to support deeply.
The best agent observability users are not just engineers — Braintrust sees clinicians, registered nurses, lawyers, and wealth advisors reviewing traces and improving agents because prompt-based systems let domain experts contribute directly in natural language.
Observability and evals are basically the same system run in different modes — Hetzel’s framing is that evals are just batch observability with known inputs, while production observability is the same quality problem in real time with unknown inputs.
The Breakdown
Agent traces can be over a gigabyte each, packed with semi-structured text, tool calls, and reasoning paths — which is why Phil Hetzel argues agent observability is a fundamentally different systems problem from traditional uptime monitoring. His core point: once software becomes non-deterministic, observability has to measure not just latency and errors, but whether the agent was grounded, on-brand, and actually useful.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
The Codex /goal Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.