Finally a good benchmark (DeepSWE)
TL;DR
DeepSWE looks more like how developers actually use coding agents — prompts are short and behavior-focused rather than over-specified, so models have to explore the repo and figure out the fix instead of following a giant hint-filled brief.
GPT-5.5 posts a clear win over Opus 4.7 here — on the DeepSWE leaderboard, GPT-5.5 Extra High hits about 70% while Opus 4.7 trails by more than 15 points, matching what Berman says he’s hearing from engineers.
The verifier quality is a huge deal — DeepSWE reports a 0.3% false-positive rate and 1.1% false-negative rate versus SWE-bench Pro’s 8.5% and 24%, which means the benchmark is much less likely to misgrade real solutions.
Cost and token efficiency make the gap look worse for Anthropic — Berman highlights GPT-5.5 at roughly $5.80 per trial versus Opus 4.7 near $16, with median output tokens around 16,000 for GPT-5.5 versus 60,000 for Opus 4.7 in the MiniSuite setup.
DeepSWE creates more meaningful model separation — instead of bunching everyone together, the benchmark spreads models out from GPT-5.5 near the top to Claude Haiku 4.5 at 0%, making it easier to tell who’s actually better.
The benchmark also surfaces behavioral differences between model families — Claude often misses one branch of multi-part requirements, while GPT-5.5 is described as more literal and consistent about implementing all stated behaviors.
The Breakdown
GPT-5.5 doesn’t just edge out Claude Opus 4.7 on DeepSWE — it beats it by 15+ points while using far fewer tokens, less time, and about one-third the cost. Matthew Berman argues the bigger story is the benchmark itself: shorter, more realistic prompts, contamination-free tasks, and a verifier that cuts false negatives from 24% on SWE-bench Pro to just 1.1%.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
The Codex /goal Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.