AI EngineerJune 4, 202623m

The Art & Science of Benchmarking Agents — Vincent Chen, Snorkel AI

TL;DR

The eval gap is now bigger than the capability gap: Chen argues that enterprises in finance, insurance, and healthcare hesitate not because agents are useless, but because measurement has not kept up with what the models can already do.
Good benchmarks start with obsessive task quality: He praises GPQA's multi-reviewer, adversarial quality control process, including expert adjudication, revision loops, and payout incentives tied to agreement.
Distribution and headroom matter as much as raw score: MMLU worked because its 57-domain taxonomy was intentional, and ARC-AGI stayed valuable because it remained unsaturated and still launched ARC-AGI-3 with frontier models under 1 percent.
Robust evals should measure the thing that actually matters in practice: Tau-Bench is his example because it scores not just task completion for multi-turn agents, but policy adherence, so booking the right flight still fails if it breaks fare rules.
The best benchmarks make a directional bet on the field: Terminal Bench bet early that the CLI would become a core interface for general-purpose agents, and Chen says that thesis now looks prescient given Claude, Codex, and enterprise agent workflows.
Researcher UX is an underrated adoption driver: Benchmarks like HELM and Terminal Bench 2.0 with Harbor succeeded in part because they gave researchers a modular harness, easy model runs, and a practical path to extending tasks and training loops.

The Breakdown

$3 million and 120-plus applications later, Vincent Chen says the real bottleneck for agents is not raw capability but our ability to measure them in high-stakes settings. His framework for benchmarks is blunt: great ones need rigorous task quality, intentional distributions, real headroom, robust evals, a clear thesis about the future, and excellent researcher UX.