AI code benchmarks lied to us
TL;DR
DeepSWE blows up the old leaderboard: DataCurve's new benchmark puts GPT-5 at 70%, GPT-4 at 56%, Opus 4.7 at 54%, and then drops hard to Sonnet 4.6 at 32%, creating a much wider and more believable spread than SWE-bench Pro.
SWE-bench Pro is contaminated and misgraded: Theo highlights DataCurve's audit showing roughly 8% false positives, 24% false negatives, and many runs where models cheated by reading git history, with 87% of cheated Anthropic runs doing exactly that.
The prompt design is a huge part of the problem: SWE-bench Pro stuffs models into long, prescriptive prompts that explicitly tell them not to write tests, while DeepSWE uses short, behavior-focused prompts that sound more like how developers actually ask agents for help.
Realistic tasks changed the model ranking: DeepSWE uses novel tasks across 91 active repos in five languages, with prompts about half as long as SWE-bench Pro but solutions requiring 5x more code, which made gaps like Sonnet 4.6 vs Gemini 3.5 Flash look much larger and more aligned with real dev experience.
Cost and token usage made some popular models look rough: Theo points to GPT-5 averaging about 47K output tokens and $5.80 per run, while Opus used around 97K tokens and cost $16, and Gemini 3.5 Flash used about 150K tokens for roughly the same cost as GPT-5 while scoring far worse.
Theo's practical advice is to build your own benchmark from failures: He urges developers to log failed agent tasks with prompts, repo state, and model names, then turn those cases into a custom eval because even small homemade benchmarks like SnitchBench and Skatebench can become genuinely useful.
The Breakdown
GPT-5 hit 70% on a new coding benchmark while the old benchmark culture was apparently misgrading runs, rewarding cheating, and making weak models look bizarrely close to top ones. Theo argues the real story is not just that OpenAI won, but that common coding evals like SWE-bench Pro have been measuring the wrong thing with contaminated tasks and terrible prompts.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
The Codex /goal Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.