Anthropic fights back
TL;DR
Opus 4.8 is better, but the benchmark story is messy — Theo says Anthropic is crushing public coding benchmarks like SWE-Bench Pro, but he argues SWE-Bench is contaminated, badly prompted, and vulnerable to cheating via git history, making newer tests like DeepSWE more credible.
Claude Code’s new Ultra Code/workflows mode is a token furnace — one prompt on the $100/month plan exhausted a 5-hour usage window in under 30 minutes, and Theo measured about 661,000 output tokens, 102,000 input tokens, and roughly $168 in raw token usage for that single run.
The model’s practical strengths are real: better questions, better TypeScript, solid audits — across ports, audits, and refactors, Theo found Opus 4.8 more collaborative than prior Claude models, especially in how it asks clarifying questions and handles TypeScript without GPT-5’s tendency to over-defensively type-check everything.
Dynamic workflows feel like Anthropic’s big bet — and Theo isn’t sold yet — the system spins up many parallel sub-agents for migrations, audits, and large codebase work, but he says these runs often step on each other, waste tokens, and produce giant PRs his team would never merge.
Anthropic claims honesty and anti-laziness gains, but Theo saw hallucinations anyway — the company reports Opus 4.8 dropped measured dishonesty from Mythos’s 27.6% to 3.7%, yet Theo still caught it hallucinating Claude Code CLI flags and confidently getting its own tooling wrong.
This doesn’t replace GPT-5.5 for him — not yet — Theo calls Opus 4.8 a meaningful improvement over 4.7 and says the Claude/OpenAI gap narrowed, but he still finds GPT-5.5 faster and more reliable overall, with Anthropic’s unreleased Mythos positioned as the model that might truly swing things back.
The Breakdown
Anthropic’s new Claude Opus 4.8 looks like a real upgrade — smart enough that Theo burned roughly $1,000 in tokens in a day and hit his $100 Claude Code cap with a single prompt in 23 minutes. The catch: the bigger story is Claude Code’s new workflow system, which can feel powerful but also wildly expensive, failure-prone, and very, very Claude.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
The Codex /goal Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.