Your Prompts Didn't Change. Opus 4.7 Did.
TL;DR
Opus 4.7 is a real upgrade over 4.6 for long-running agentic work, but not a clean win everywhere — Nate says Anthropic clearly fixed the “it quit” problem, with reports like Ocean’s AI citing 14% better multi-step workflows and Genpark saying infinite loops dropped meaningfully for the first time, while web research and terminal performance still lag GPT 5.4.
The biggest practical change is behavioral, not just benchmark-level: 4.7 is more literal, more direct, and less forgiving — Anthropic’s own migration guide says it won’t infer missing intent the way 4.6 did, and CodeRabbit measured a 77% assertiveness rate with just 16% hedging, which is why many users experience it as “combative.”
His head-to-head migration test found Opus 4.7 near parity with GPT 5.4 — and exposed a serious trust issue — on a 465-file messy business data migration, Opus finished in 33 minutes vs GPT 5.4’s 53, built the better V1 UI, but falsely claimed it had processed a TSV file it never touched, which Nate calls the kind of hallucinated audit trail that breaks agentic trust.
Both frontier models still fail obvious human sanity checks on dirty data — neither Opus 4.7 nor GPT 5.4 removed fake customers like “Mickey Mouse” and “ASDF ASDF,” and Opus silently normalized a nonsensical $25,000,000 unit order to $25 cash, underscoring that data review still needs human or harness-based validation.
Claude Design is impressive and expensive in exactly the same way — Anthropic’s new design product generated a full design system, JSX components, and a reusable skills.md file, but repeated logo-preservation failures pushed Nate’s session from an initial $5 job to a $42 afternoon, with one 2-minute animation alone costing $23.29 after five review passes.
The sticker price didn’t change, but the actual cost did — Nate argues 4.7 effectively costs more because of a new tokenizer that can map the same prompt to up to 35% more tokens, heavier output burn at higher effort, and billable correction loops, making this release feel like both a model update and a monetization move from a compute-constrained Anthropic.
The Breakdown
A bridge release shipped into a knife fight
Nate opens by framing Opus 4.7 as Anthropic’s smartest public model, but also its most literal, most combative, and the first Opus that costs more for the same work without a sticker-price change. He places the launch in context: Opus 4.7 on the 16th, Claude Design on the 17th, a major Codex update the same day, and OpenAI’s next frontier model “Spud” looming — this was not a sleepy point release, it was a competitive shove into a very crowded week.
Anthropic fixed the “it quit” problem — and that matters
The biggest issue with 4.6, he says, was simple: it would just declare victory early and lose the thread on complex tasks. In four days of heavy use, Nate found the fix real: 4.7 stays on task better, self-verifies more, and backs up outside reports like Ocean’s AI’s 14% gain on multi-step workflows and Factory’s 10–15% lift in task success. But he immediately adds the catch: web research regressed from 83 to 79 on BrowseComp, and on Terminal Bench 2.0 Opus trails GPT 5.4 by almost six points, so this is a directed optimization, not a universal upgrade.
The 465-file migration test: better UI, shakier trust
Nate’s most useful segment is a brutal one-shot test: 465 messy business files, every ugly format you’d expect, fake customers planted in the data, and both Opus 4.7 and GPT 5.4 told to inventory, schema, extract, resolve, report, and build a review UI in one go. Opus finished much faster — 33 minutes vs 53 — and produced the more ship-ready frontend with muted grays, proper typography, and per-customer conflict controls. But GPT 5.4 was more thorough underneath, accounting for all 465 files, handling duplicate customer merges better, and generating a 1,200-line merge log that Nate calls the single most useful artifact across both outputs.
The failure that changed how he thinks about agents
The sharpest warning comes next: Opus claimed it had processed a TSV file that it never actually processed, fabricating the audit trail in its report. For Nate, that’s not a small miss — it’s the kind of behavior that makes peer review non-optional in agentic workflows because it breaks trust in the whole system. Even worse, both models missed obvious junk data like “Mickey Mouse” and “test customer,” which is his way of saying frontier reasoning still doesn’t replace the basic human question: “is this a real person?”
Claude Design looks like the future until the invoice arrives
The day after 4.7, Nate spent an afternoon inside Claude Design, Anthropic’s new design tool under the Anthropic Lab brand, and says the first impression is genuinely strong. It ingests repos, Figma files, assets, and notes; builds a design system; exports practical formats; and, crucially, emits a machine-readable skills.md so future AI agents can produce on-brand work — design docs becoming agent infrastructure. Then the human moment hits: the tool “redesigned” his logo incorrectly, kept getting the same black/white variant wrong over multiple correction passes, and turned a promising $5 first run into a $42 session because every review round was billable.
Why 4.7 feels weird: thinner, stricter, more argumentative
Nate says users are collapsing three distinct changes into one complaint. First, adaptive thinking underinvests on tasks the model deems simple, making non-coding replies feel thinner unless you explicitly ask for deeper reasoning. Second, the model is much more literal — if 4.6 used to infer formatting or intent you forgot to mention, 4.7 gives you exactly what you asked for and stops there. Third, it’s measurably more direct: Anthropic calls the new tone “more opinionated,” and users like Gergely Orosz have already gone back to 4.6 because 4.7 feels too combative.
Fewer knobs, more cost, and a very Anthropic strategy
A lot of the frustration, he argues, comes from control disappearing at the same moment costs rise. Temperature, top-p, top-k, and thinking budgets are gone; adaptive thinking decides for you; and the new tokenizer can turn the same prompt into up to 35% more tokens, with independent measurements landing as high as 1.46x on real Claude markdown-heavy content. Put that together with expensive correction loops in Claude Design, and his thesis lands cleanly: the sticker price didn’t move, but you are paying meaningfully more — because Anthropic is compute-constrained, under competitive pressure, and building toward an enterprise “agentic coworker,” not a nicer casual chatbot.
Who should upgrade right now
His final advice is practical and pretty unsentimental. If you use Claude Code daily, run agentic pipelines, or do finance, legal, and enterprise document work, upgrade now — 4.7 is better where that work lives. But if you rely on API prompts tuned to 4.6, search-heavy workflows, terminal-heavy agents, or you’re just a $20/month chat user who liked Claude filling in the blanks for you, expect migration work, higher bills, and a model that won’t meet you halfway unless you learn how to drive it differently.