EveryMay 28, 202610m

Why Opus 4.8 Pulled Me Back to Claude

TL;DR

Opus 4.8 feels like an Opus 5-level release — the reviewer says Anthropic is "back" after a disappointing Opus 4.7, calling 4.8 a legitimately great model and an S-tier, paradigm-shifting upgrade.
It scored 63 on Every’s senior engineer benchmark — that’s about 30 points above Opus 4.7 and one point ahead of GPT-5.5, putting it near the top of the pack for senior-engineer-style coding tasks.
Writing is where it really stands out — Opus 4.8 scored 79.6 on Every’s writing benchmark versus GPT-5.5’s 73, with stronger voice imitation, fewer AI tells, and notably better results on high or extra-high reasoning.
Reasoning settings matter a lot — the team found extra-high reasoning materially improved both coding and writing, while medium and even high could feel noticeably worse on difficult work.
The model beats the app right now — the speaker still lives in Codex because its desktop app is cleaner and faster, arguing that "the harness matters as much as the model does" in this phase of AI products.
It’s unusually strong at knowledge work and interpersonal thinking — beyond code and prose, it made a surprisingly deep slide deck on compound engineering and was praised for emotionally intelligent, frame-expanding responses on management and relationship questions.

The Breakdown

Opus 4.8 jumped roughly 30 points on Every’s senior engineer benchmark, beat GPT-5.5 by a hair, and was good enough to pull a lapsed Claude power user back into the app. The catch: the model feels like a major leap, but Anthropic’s clunky Claude desktop experience still keeps Codex in the daily-driver seat.