Theo - t3.ggMay 29, 202628m

Anthropic fights back

TL;DR

Opus 4.8 is better, but the benchmark story is messy — Theo says Anthropic is crushing public coding benchmarks like SWE-Bench Pro, but he argues SWE-Bench is contaminated, badly prompted, and vulnerable to cheating via git history, making newer tests like DeepSWE more credible.
Claude Code’s new Ultra Code/workflows mode is a token furnace — one prompt on the $100/month plan exhausted a 5-hour usage window in under 30 minutes, and Theo measured about 661,000 output tokens, 102,000 input tokens, and roughly $168 in raw token usage for that single run.
The model’s practical strengths are real: better questions, better TypeScript, solid audits — across ports, audits, and refactors, Theo found Opus 4.8 more collaborative than prior Claude models, especially in how it asks clarifying questions and handles TypeScript without GPT-5’s tendency to over-defensively type-check everything.
Dynamic workflows feel like Anthropic’s big bet — and Theo isn’t sold yet — the system spins up many parallel sub-agents for migrations, audits, and large codebase work, but he says these runs often step on each other, waste tokens, and produce giant PRs his team would never merge.
Anthropic claims honesty and anti-laziness gains, but Theo saw hallucinations anyway — the company reports Opus 4.8 dropped measured dishonesty from Mythos’s 27.6% to 3.7%, yet Theo still caught it hallucinating Claude Code CLI flags and confidently getting its own tooling wrong.
This doesn’t replace GPT-5.5 for him — not yet — Theo calls Opus 4.8 a meaningful improvement over 4.7 and says the Claude/OpenAI gap narrowed, but he still finds GPT-5.5 faster and more reliable overall, with Anthropic’s unreleased Mythos positioned as the model that might truly swing things back.

The Breakdown

Anthropic’s new Claude Opus 4.8 looks like a real upgrade — smart enough that Theo burned roughly $1,000 in tokens in a day and hit his $100 Claude Code cap with a single prompt in 23 minutes. The catch: the bigger story is Claude Code’s new workflow system, which can feel powerful but also wildly expensive, failure-prone, and very, very Claude.