AI Engineer·April 28, 2026·25m

Why building eval platforms is hard — Phil Hetzel, Braintrust

TL;DR

Evals start in spreadsheets, but that’s just documentation — Phil Hetzel says the classic “Google Sheet + for loop” setup is a valid starting point, but it quickly breaks down when you need experiment comparison, analytics, and collaboration beyond a lone engineer.
The hard part of eval platforms isn’t the UI — it’s the data layer — Hetzel jokes that anyone can “vibe code” an eval UI, but the real challenge is handling huge, messy, high-velocity agent traces with low-latency reads, aggregate analysis, and full-text search.
Observability and eval are really one flywheel — Braintrust originally focused on offline evals, then realized customers were piping hourly production traffic into evals anyway, which pushed them to treat tracing, logging, online evals, and offline improvement as one continuous loop.
Good evals are a multi-persona workflow, not an engineer-only task — Product engineers, AI engineers, subject-matter experts, and nontechnical reviewers all need to participate, because the people closest to users often surface the most important failure modes.
LLM traces are fundamentally weirder than normal app telemetry — Hetzel says traditional spans might be a few kilobytes, while LLM spans can hit 10–20 MB, packed with unstructured text and multimodal artifacts, which is why normal observability stacks struggle.
The next generation of eval platforms will serve agents as much as humans — Hetzel points to headless use cases where people want Codex or Claude Code to query eval data directly in SQL, discover “unknown unknowns” via topic modeling, and improve agents without relying on a UI.

The Breakdown

A packed room for the least glamorous AI topic

Phil Hetzel opens with a professor anecdote: he used to expect 130 students and end up teaching 10, so seeing a packed room for “evals” feels like a small miracle. He introduces himself as Braintrust’s head of solutions engineering — the person who sees how customers actually get eval and observability systems into production.

Why he joined Braintrust after watching POCs stall out

Before Braintrust, Hetzel spent 12 years in consulting at KPMG and Slalom, where he led Slalom’s global Databricks business. The pattern he saw was brutal: clients were excellent at spinning up generative AI proofs of concept, but almost none of them made it to production, which is what pushed him toward eval infrastructure in the first place.

The spreadsheet phase is real — and respectable

He asks who’s doing evals in Google Sheets and makes a point of saying there’s “no shame” in it. That setup — inputs, a loop to run the agent, and handwritten notes or scores — matters because it acknowledges the problem, but he says it’s more like documenting behavior than truly experimenting.

Why the iceberg is so much bigger than people expect

The simple version of evals sounds easy enough that, in his words, it would make for a very short talk. But once teams get serious, they discover the iceberg: versioning, analytics, collaboration, scoring, persistence, production feedback loops, and the fact that evals are a multi-persona problem involving engineers, domain experts, and product people — not just one builder working alone.

From vibe-coded UI to actual experimentation

Hetzel has some fun with the engineer who “puffs their chest out” and says, “I can just vibe code this in a branch.” He thinks that’s a reasonable next step after spreadsheets: build a nicer UI, use a real database, and make evals accessible to more people — but he warns that this still often becomes a reporting tool unless you add a real playground where users can tweak prompts or configs and compare runs side by side.

The key shift: evals need production traces

This is where, as he puts it, “the rubber starts beating the road.” The best evals come from understanding actual failure modes, and the best way to find those is production trace data from real users, which is why Braintrust started treating observability and eval as one system rather than two products.

The nasty systems problem hiding underneath

Once you ingest production traffic, you’re no longer just building eval software — you’re building a tracing and logging platform. Hetzel says agent traces are “really nasty”: semi-structured or unstructured, full of text, high-volume, and sometimes so large that a single span can be 10–20 megabytes, making a normal Postgres-centric architecture buckle.

What comes next: unknown unknowns and agents using the platform too

He closes by arguing that mature eval platforms should tell teams the “unknown unknowns” through topic modeling instead of forcing humans to manually comb through traces. And increasingly, the platform isn’t just for people: teams want headless, SQL-friendly backends that coding agents like Codex or Claude Code can use directly, plus all the enterprise plumbing they didn’t even get to in the talk — RBAC, data masking, and automatic tracing through gateways.