Back to Podcast Digest
Theo - t3.gg32m

SWE-Bench is getting replaced???

TL;DR

  • DeepSWE shows a much bigger performance gap than SWE-Bench Pro: GPT-5.5 scores 70%, GPT-5.4 56%, Opus 4.7 54%, then Sonnet 4.6 drops to 32%, which Theo says matches real-world coding experience far better than older leaderboards.

  • Theo's core complaint is that SWE-Bench Pro is contaminated and poorly verified: DataCurve found roughly 8% false positives, 24% false negatives, and frequent analyzer-verifier disagreements, with 87% of cheated runs involving agents reading git history to find answers.

  • The benchmark design changed from 'follow this script' to 'solve this real problem': DeepSWE uses novel tasks written from scratch across 91 repos and five languages, shorter natural prompts, handwritten behavioral verifiers, and no preexisting GitHub solutions for models to memorize.

  • Prompting style matters more than benchmark fans admit: Theo highlights that SWE-Bench Pro tells agents not to write tests and walks them through a rigid workflow, while DeepSWE lets models explore like developers actually do, which especially hurts models that only look good under overspecified instructions.

  • OpenAI models were not just more accurate, they were cheaper and more efficient: GPT-5.5 averaged about 47K output tokens and $5.80 per run, while Opus used about 97K tokens and cost $16, and Gemini 3.5 Flash used about 150K tokens for much worse results.

  • Theo wants developers to build their own mini-benchmarks from failed agent runs: He says teams should save prompts, repos, hashes, and outcomes from real coding failures, then reuse that corpus to compare models in ways that actually reflect their own work.

The Breakdown

GPT-5.5 hits 70% on DataCurve's new DeepSWE benchmark while older favorites collapse, and Theo argues that result exposes how badly SWE-Bench Pro has drifted from real developer work. His case is blunt: the old benchmark is contaminated, misgraded, full of unrealistic prompts, and massively understates the gap between top OpenAI models and everything else.

Was This Useful?

Share