AI EngineerJune 6, 202619m

Evals Are Broken, Use Them Anyway — Ara Khan, Cline

TL;DR

Most teams are wrong about evals in two opposite ways: Khan says one camp treats benchmark dashboards like objective truth, while the other relies on pure taste and vibes, and both miss that evals are useful only as imperfect approximations.
Do not trust model-maker benchmark claims at face value: she points to vendor benchmark maxing, including Meta-style leaderboard claims, and argues models with similar scores can feel radically different in real-world use.
Stay current, but do not be the first adopter: frontier models now change every few months, so her advice is to wait a couple weeks, let the hype settle, and only switch if the model still holds up.
Old evals stop measuring frontier ability surprisingly fast: Khan cites OpenAI saying SWE-bench Verified no longer captures frontier coding capability, because tasks like Fibonacci or matrix multiplication do not reflect real software engineering work.
Good agent evals look like real jobs, not one-shot trivia: at Cline, the team built coding evals from opted-in user data and used tools like Stanford’s Terminal Bench with 89 tasks that can take 30 to 40 minutes each because agents must inspect files, run scripts, and avoid breaking other things.
The point of evals is targeted improvement, not bragging rights: Cline used failure traces to find issues in models, harnesses, and task design, raising a score from around 43 percent by fixing bugs, adjusting CPU and memory, tuning timeouts, and refining model-specific prompting without overfitting.

The Breakdown

Ara Khan’s core claim is blunt: benchmark numbers are often a hoax, but teams still need evals because they are one of the few practical ways to systematically improve agents. Her fix is a middle path between leaderboard worship and pure vibes, using fresh, realistic tasks to hill-climb on actual product failures.