Opus 4.8 Scored 81. Your Workflow Doesn't Care.
TL;DR
Opus 4.8 looks more like a checkpoint than Anthropic's big swing: Nate says the May 28 release was timed to support a major funding announcement, while the real anticipated model is Mythos, the much-teased large release Anthropic still has not shipped.
Higher reasoning does not reliably help on 4.8: He points to Vending Bench, where Opus 4.8 regressed versus 4.7 and where 4.8 on high beat 4.8 on max, which breaks the usual expectation that more reasoning effort should produce better results.
Anthropic's alignment focus may be causing costly overthinking: Nate says reasoning traces from 4.8 max show the model spiraling on constitutional concerns like warmth, tone, and alignment, to the point that it can become less effective on practical tasks.
The harness now matters more than the benchmark score: His daily-driver choice is OpenAI 5.5 in Codex not because 4.8 lacks strengths, but because Codex handles long-running tasks, files, browser actions, and multi-hour execution more reliably.
A real-world website test exposed the gap: In his side-by-side trial, 5.5 built and deployed two Markdown-domain websites, while 4.8 errored out twice and struggled with compute limits, even before iteration on design quality.
Slashworkflows is the standout 4.8 innovation: Nate praises Claude Code's new /workflows command for letting the model compose, reveal, and assign multi-agent workflows transparently, and predicts that pattern will spread across AI products this summer.
The Breakdown
Opus 4.8 may be posting top-tier scores like 81, but Nate B Jones argues that does not matter if the model overthinks, behaves inconsistently, and sits inside a weaker harness than OpenAI's 5.5 in Codex. His bigger point is that in 2026 the real contest is no longer just model IQ, but whether the surrounding product can reliably turn that intelligence into finished work.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
The Codex /goal Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.