How I AIJune 15, 202640m

Braintrust CEO: Evals are the new PRD for AI products

TL;DR

Evals are the new PRD: Goyal says AI shifts programming from defining the how to defining the what, so the core product job becomes writing quantifiable success criteria and examples instead of prose-only specs.
Agents can handle serious systems work, not just CRUD apps: At Braintrust, agents run continuous experiments on database internals like Bloom filters, Tantivy column stores, EC2 to S3 latency, and query patterns across billions of traces over 90 days.
The real quality gain is rigor, not magic code generation: Goyal argues no human staff engineer will manually run as many benchmarks, compare as many algorithms, or keep testing edge cases as relentlessly as an agent guided by strong evals.
Taste scales when you encode it: Braintrust uses evals to capture designer David's judgment, then applies his quality bar across more outputs so his taste matters more, not less.
Maker time matters more in the agent era: Goyal keeps mornings for meetings, afternoons for coding, and runs roughly four to six foreground agents in parallel through tmux sessions, with heavier experiments offloaded to remote compute.
If the agent is flailing, fix the eval and restart: His default recovery move is not to argue with the model but to close the session, improve the scoring setup, and try again from scratch, especially after watching a vibe-coded 3,000-line eval script turn into junk.

The Breakdown

Braintrust CEO Ankur Goyal argues that evals have become the new PRD for AI products, and that coding agents are already good enough to tackle gnarly database and infrastructure work if you define success rigorously. His case is blunt: there is now "no excuse to not have rigor" when an agent can spend days benchmarking Bloom filters, column stores, and latency tradeoffs that no staff engineer would test by hand.