AI EngineerMay 28, 202620m

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

TL;DR

Agent observability is not just Datadog for LLM apps — Phil Hetzel says traditional observability asks whether a system is up, fast, and error-free, while agent observability also has to judge qualitative behavior like grounding, tool use, and brand alignment.
Non-determinism changes what you have to observe — classic apps follow known code paths, but agents can take many valid routes, so teams need broader visibility into why one reasoning path happened instead of another.
The trace shape is the real systems challenge — Braintrust has seen agent traces exceed 1 GB and individual spans hit 20 MB, which makes ingestion, indexing, and real-time querying far harder than standard observability pipelines.
Text search becomes a first-class observability feature — Hetzel highlights Braintrust’s use of a Tantivy-based full-text index so teams can query traces for terms like 'Amazon,' something traditional observability stacks rarely need to support deeply.
The best agent observability users are not just engineers — Braintrust sees clinicians, registered nurses, lawyers, and wealth advisors reviewing traces and improving agents because prompt-based systems let domain experts contribute directly in natural language.
Observability and evals are basically the same system run in different modes — Hetzel’s framing is that evals are just batch observability with known inputs, while production observability is the same quality problem in real time with unknown inputs.

The Breakdown

Agent traces can be over a gigabyte each, packed with semi-structured text, tool calls, and reasoning paths — which is why Phil Hetzel argues agent observability is a fundamentally different systems problem from traditional uptime monitoring. His core point: once software becomes non-deterministic, observability has to measure not just latency and errors, but whether the agent was grounded, on-brand, and actually useful.