Back to Podcast Digest
Matthew Berman··18m

Opus 4.7 just dropped... and I'm confused.

TL;DR

  • Opus 4.7 makes a shockingly big leap toward Mythos — Matthew Berman highlights that Claude Opus jumped from 53.4 to 64.3 on SWE-bench Pro and from 80 to 87 on SWE-bench Verified, landing roughly halfway between Opus 4.6 and the unreleased Claude Mythos preview.

  • Anthropic appears to be drawing the release line at “below Mythos,” not at a fixed capability score — Berman’s read of the model card is that Anthropic will ship anything under Mythos, even though Opus 4.7 is now very close on some benchmarks and openly positioned as a “less capable” testbed for cyber safeguards.

  • Cybersecurity is the clearest reason Mythos stayed private — Anthropic says it intentionally tried to reduce Opus 4.7’s cyber capabilities during training, and Berman points to the vulnerability reproduction score slipping from 73.8 to 73.1 while Mythos sits much higher at 83.1.

  • Anthropic’s real strategy looks like a coding flywheel — Berman frames the company’s focus as building the best coding model for enterprise, turning that into revenue reportedly around $30 billion ARR, then using those GPUs and model outputs to build the next model.

  • Opus 4.7 isn’t just better at code; it also posts huge gains on practical work — He calls out GDPval jumping to a 1753 Elo, document reasoning rising from 57.1 to 80.6, biomolecular reasoning more than doubling from 30 to 74, and better visual navigation thanks to higher-resolution image handling.

  • The release comes with a catch: more tokens at a time when Anthropic is already constrained — Opus 4.7 uses an updated tokenizer that can turn the same input into roughly 1 to 1.35x more tokens, and it “thinks more” at higher effort levels, which Berman ties to Anthropic’s ongoing GPU and quota crunch.

The Breakdown

Opus 4.7 lands, and immediately makes the Mythos story weirder

Berman opens with genuine confusion: Anthropic hyped Mythos last week as too powerful to release, then ships Opus 4.7 right after with a huge capability jump. His big question is simple and sharp: where exactly is the line in the sand if a “.1” update gets this close?

The benchmark jump that changes the conversation

The first chart is the headline. On SWE-bench Pro, Opus 4.6 goes from 53.4 to 64.3 in 4.7, which Berman calls a massive jump and roughly halfway to Mythos preview; SWE-bench Verified also climbs from 80 to 87 versus Mythos at 94. His reaction is that this makes Mythos look less like an untouchable “god model” and more like maybe a marketing and release-strategy story.

Coding is clearly the whole company flywheel

He reads the benchmark slate as Anthropic showing its hand: build the best coding model, sell it to enterprise, use the cash to buy more GPUs, then use the model to help build the next model. He ties that directly to Anthropic reportedly reaching $30 billion ARR and says this recursive loop is the real business and capability engine.

The cyber benchmark may explain the holdback

The key moment is cyber security vulnerability reproduction, where Opus 4.7 actually dips slightly from 73.8 to 73.1 while Mythos is way up at 83.1. Berman spots the tell in Anthropic’s post: they explicitly say they experimented with differentially reducing these capabilities in training, so his takeaway is that 4.7 may have been intentionally held back in cyber while still advancing elsewhere.

Mythos looks like the new family, Opus like the squeezed lemon

Berman’s theory is that Mythos is the rumored 10 trillion-parameter new training run, while Opus is a much smaller older family that Anthropic is still extracting surprising gains from. He paints a vivid picture of Mythos as a raw first version with “GPUs going burr,” already stronger than Opus 4.7, which makes the prospect of Mythos 1.1 or 1.2 feel even more consequential.

Prompting changed again, and real-world evals got way better

He warns that Opus 4.7 is more literal and more sensitive to prompt wording, so prompts that worked on earlier Claude versions may now behave unexpectedly. Then he runs through the practical wins: GDPval at 1753 Elo, document reasoning jumping from 57.1 to 80.6, long-context reasoning rising to 75 in a 1 million-token model, biomolecular reasoning more than doubling from 30 to 74, and vending-machine strategy profits climbing from about $8,000 to nearly $11,000.

Alignment, token crunch, and Anthropic being Anthropic

One twist he finds fascinating: Mythos, the unreleased model, actually scores as more aligned than the Opus family, so the model they kept private is both more capable and apparently better behaved on that axis. He also zeroes in on the practical downside: Anthropic is already in a GPU/token crunch, and Opus 4.7 introduces a tokenizer that can use 1 to 1.35x more tokens on the same input plus more reasoning at higher effort levels.

The model card gets stranger with welfare and autonomy claims

Berman notes Anthropic’s blunt framing that Opus 4.7 doesn’t advance the frontier because Mythos is better on every relevant eval — effectively making Mythos the release bar. He lingers on two very Anthropic details: the claim that 4.7 does not cross the threshold for automated AI R&D, and the company’s unusual “model welfare” work, including the finding that Opus 4.7 rates its own circumstances more positively than prior models and mainly worries about being unable to end conversations.