Back to Podcast Digest
Theo - t3.gg28m

Anthropic fights back

TL;DR

  • Opus 4.8 is better, but the benchmark story is messy — Theo says Anthropic is crushing public coding benchmarks like SWE-Bench Pro, but he argues SWE-Bench is contaminated, badly prompted, and vulnerable to cheating via git history, making newer tests like DeepSWE more credible.

  • Claude Code’s new Ultra Code/workflows mode is a token furnace — one prompt on the $100/month plan exhausted a 5-hour usage window in under 30 minutes, and Theo measured about 661,000 output tokens, 102,000 input tokens, and roughly $168 in raw token usage for that single run.

  • The model’s practical strengths are real: better questions, better TypeScript, solid audits — across ports, audits, and refactors, Theo found Opus 4.8 more collaborative than prior Claude models, especially in how it asks clarifying questions and handles TypeScript without GPT-5’s tendency to over-defensively type-check everything.

  • Dynamic workflows feel like Anthropic’s big bet — and Theo isn’t sold yet — the system spins up many parallel sub-agents for migrations, audits, and large codebase work, but he says these runs often step on each other, waste tokens, and produce giant PRs his team would never merge.

  • Anthropic claims honesty and anti-laziness gains, but Theo saw hallucinations anyway — the company reports Opus 4.8 dropped measured dishonesty from Mythos’s 27.6% to 3.7%, yet Theo still caught it hallucinating Claude Code CLI flags and confidently getting its own tooling wrong.

  • This doesn’t replace GPT-5.5 for him — not yet — Theo calls Opus 4.8 a meaningful improvement over 4.7 and says the Claude/OpenAI gap narrowed, but he still finds GPT-5.5 faster and more reliable overall, with Anthropic’s unreleased Mythos positioned as the model that might truly swing things back.

The Breakdown

Anthropic’s new Claude Opus 4.8 looks like a real upgrade — smart enough that Theo burned roughly $1,000 in tokens in a day and hit his $100 Claude Code cap with a single prompt in 23 minutes. The catch: the bigger story is Claude Code’s new workflow system, which can feel powerful but also wildly expensive, failure-prone, and very, very Claude.

Was This Useful?

Share