Back to Podcast Digest
AI Engineer20m

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

TL;DR

  • AI benchmarks are fragmented and go stale almost immediately — Nicholas Kang says 10+ new benchmarks can drop in a day, many live in papers, and their leaderboards stop mattering once authors move on.

  • Published eval results can be misleading because setups aren’t truly comparable — Kang recounts an AI lab rerunning a benchmark with API-side optimization like prompt compaction and posting much stronger scores, showing how hidden configuration choices change outcomes.

  • Open benchmark creation matters because frontier labs miss huge parts of the real world — a wastewater treatment engineer in Turkey built a safety benchmark from 20 years of field experience after deadly protocol failures, creating data no lab would have had on hand.

  • Kaggle is testing a consumer-facing 'agent exam' so anyone can score an agent with one prompt — the MVP launched a week earlier and already logged 500+ evaluated agents, plus notebook spin-offs and even an 'exam prep course.'

  • Game Arena uses model-vs-model games to avoid benchmark saturation, but the costs are brutal — Michael Aaron says poker alone required about 400,000 hands for statistical significance, turning eval design into a billing and methodology problem.

  • For coding agents, the harness may matter more than the model — citing Morph’s March 16 post on SWE-bench Pro, Aaron notes six frontier models were within a few points of each other while harness choice could swing results by 22%.

The Breakdown

A Google DeepMind/Kaggle team says AI evals are “kind of broken” because benchmarks are scattered, stale, and easy to game — and they’re trying to fix that with open hackathons, standardized agent exams, PvP game arenas, and a community benchmark platform. The sharpest reveal: they’re already seeing 500+ agents take their new exam in a week, while one benchmark creator used the system to test AI on wastewater safety after real-world incidents in Turkey killed people.

Was This Useful?

Share