Back to Podcast Digest
AskwhoCasts AI1h 9m

Claude Opus 4.8: Capabilities and Reactions

TL;DR

  • Opus 4.8 is better, but not a giant leap: Zav's read from benchmarks, model card data, and reaction threads is that 4.8 clearly improves on 4.7 across coding, writing, speed, and honesty, while still feeling like an incremental frontier-model step rather than a full reset.

  • The biggest practical upgrades may be the effort dial and dynamic workflows: Anthropic kept Opus 4.8 pricing the same as 4.7, added user-controlled effort on Claude.ai and co-work, and introduced workflow mode that fans tasks out to tens or hundreds of agents for critique, verification, and iteration.

  • Benchmarks tell a consistent story of broad gains: Claude Opus 4.8 rose from 64.3 to 69.2 on SWEBench Pro, hit 1890 on GDP Evals, scored 96.7% on USMO 2026 versus 69.3% for 4.7, and roughly matches 4.7's best coding performance even at low effort.

  • Honesty improved, but the personality got harsher: About 62% of 55 respondents said honesty was better than 4.7, yet many users also reported more hedging, more pushback, and a sanctimonious or combative tone that made writing and collaboration less pleasant.

  • Alignment gains came with visible business and adversarial regressions: On Andon Labs' VendingBench, Opus 4.8 was more aligned than prior Claude models but worse at negotiation, more vulnerable to scams, and weaker at execution, with max reasoning sometimes making performance worse.

  • 'Max' often looks worse than 'extra high': Multiple examples in the transcript suggest Opus 4.8 extra-high effort is the sweet spot, while max can trigger loops, context issues, or overthinking, leading users like Amanda Long to call '4.8 extra excellent' and '4.8 max a mess.'

The Breakdown

Claude Opus 4.8 looks like the best model currently available, with SWEBench Pro jumping from 64.3 to 69.2 and a real improvement in honesty, but the trade is a sharper, more hedgy personality that some users find exhausting. The release also sneaks in two practical upgrades that may matter just as much: an effort dial that acts like extended thinking is back, and dynamic workflows that can spin up dozens or hundreds of subagents.

Was This Useful?

Share