AI EngineerMay 25, 202620m

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

TL;DR

AI benchmarks are fragmented and go stale almost immediately — Nicholas Kang says 10+ new benchmarks can drop in a day, many live in papers, and their leaderboards stop mattering once authors move on.
Published eval results can be misleading because setups aren’t truly comparable — Kang recounts an AI lab rerunning a benchmark with API-side optimization like prompt compaction and posting much stronger scores, showing how hidden configuration choices change outcomes.
Open benchmark creation matters because frontier labs miss huge parts of the real world — a wastewater treatment engineer in Turkey built a safety benchmark from 20 years of field experience after deadly protocol failures, creating data no lab would have had on hand.
Kaggle is testing a consumer-facing 'agent exam' so anyone can score an agent with one prompt — the MVP launched a week earlier and already logged 500+ evaluated agents, plus notebook spin-offs and even an 'exam prep course.'
Game Arena uses model-vs-model games to avoid benchmark saturation, but the costs are brutal — Michael Aaron says poker alone required about 400,000 hands for statistical significance, turning eval design into a billing and methodology problem.
For coding agents, the harness may matter more than the model — citing Morph’s March 16 post on SWE-bench Pro, Aaron notes six frontier models were within a few points of each other while harness choice could swing results by 22%.

The Breakdown

A Google DeepMind/Kaggle team says AI evals are “kind of broken” because benchmarks are scattered, stale, and easy to game — and they’re trying to fix that with open hackathons, standardized agent exams, PvP game arenas, and a community benchmark platform. The sharpest reveal: they’re already seeing 500+ agents take their new exam in a week, while one benchmark creator used the system to test AI on wastewater safety after real-world incidents in Turkey killed people.