AI EngineerMay 27, 202618m

The maturity phases of running evals — Phil Hetzel, Braintrust

TL;DR

Start with vibes, but document the reasons — Hetzel says early evals can absolutely begin as manual “vibe checks,” as long as a human annotator or SME records not just thumbs up/down but the justification behind each judgment.
Evals are about failure modes, not exhaustive coverage — Unlike unit tests, agent evals should target the most important ways an agent can fail, because trying to enumerate every possible failure is effectively infinite and kills shipping velocity.
Production traces are the gold-standard eval dataset — Hetzel argues teams should stop thinking of evals as synthetic tests and instead “rerun production,” pulling in real traces or at least UAT-level interactions to measure quality against actual usage.
LLM-as-judge is useful, but it also needs evals — Braintrust uses LLM judges to scale human expertise, but Hetzel warns that “putting a robe and cloak on an LLM” does not make it trustworthy; you still need ground-truth datasets and validation against human decisions.
Tool-using agents force you to evaluate whole traces, not just outputs — Once agents call APIs, databases, MCPs, or CRUD systems, the evaluation problem expands to system state, tool-call behavior, token and cost constraints, and whether offline replay can safely simulate the original environment.
The next frontier is automatic failure discovery — Hetzel points to topic modeling over production traces and CLI-driven automated eval workflows as emerging patterns for finding new failure modes and operationalizing evals continuously.

The Breakdown

“Think about evals like rerunning production” is Phil Hetzel’s core advice: the path from vibe checks to mature agent evals starts with human thumbs-up/thumbs-down judgments, then scales into LLM judges, trace-level analysis, and production-derived datasets. His bigger point is that evals and observability are really the same system viewed at different times—before launch to gain confidence, and after launch to keep it.