This model is kind of a disaster.
TL;DR
Opus 4.7 impressed early, then fell apart fast — Theo says Anthropic’s new public model had a strong first session but then repeatedly regressed over ~12 hours, calling it “one of the weirdest models ever released.”
The biggest issue wasn’t raw capability, but broken product scaffolding — his core argument is that Claude Code’s harness, system prompts, permissions, and safety layers are degrading the experience more than the underlying model itself.
Anthropic’s cyber-safety controls are blocking obviously benign work — Opus 4.7 hard-paused on a Defcon Gold Bug cryptography puzzle and leaked malware-related system prompt text while reviewing Theo’s own site, which made the model feel “labotomized.”
Instruction-following got better, but practical reasoning got worse — even after being told to upgrade dependencies to the latest versions, Opus repeatedly picked Next.js 15 instead of 16 because it didn’t verify current versions on the web.
GPT-5.4 looked more reliable on the exact same coding tasks — when Theo reran modernization and research workflows with OpenAI’s model, it explicitly checked for current stable versions, found Next.js 16, and avoided the same overconfident mistakes.
Theo thinks Anthropic’s internal/external tooling gap is making public releases look worse — his theory is that Anthropic employees use a stronger internal stack, while customers get a buggy, overconstrained Claude Code experience that masks the model’s actual strengths.
The Breakdown
Theo’s opening verdict: exciting release, deeply weird model
Theo comes in half-hyped, half-annoyed: Opus 4.7 is useful, but not the best new model, and not even Anthropic’s best overall model — just the best one the public can actually use. His framing lands hard: this model “gets dumber the more you do it,” and he says he watched it regress in real time over the course of the day.
Anthropic’s own launch story already hints at tradeoffs
He walks through Anthropic’s claims: better software engineering, stronger instruction following, better vision up to 2576 pixels on the long edge, and the same pricing as Opus 4.6 at $5 per million input tokens and $25 per million output. But he immediately points out the benchmark caveat: Anthropic says it wins on “a range of benchmarks,” not all of them, and notes it actually loses to Opus 4.6 on some areas like Agentic Search.
Safety leaks into normal use and makes the model feel broken
The first real test goes sideways fast in Claude Code Desktop. When Theo asks for design improvements for T3.gg, Opus starts talking about malware-related system reminders and claims it’s ignoring prompt injections — even though he didn’t configure any of that himself. He says Anthropic is trying so hard to suppress malicious cyber use that it has accidentally poisoned the normal UX, and the required “cyber verification program” for legitimate security work only makes the whole thing feel sillier.
The Defcon puzzle test is where Theo really loses patience
He gives Opus a Gold Bug cryptography puzzle from Defcon — not hacking, just weird bottle labels and a pirate-themed decoding challenge that had previously taken his team multiple days. Opus appears to make real progress, trying ciphers and writing code, then the chat gets hard-stopped by safety filters and he’s told to continue with Sonnet 4 instead. That’s the breaking point for him: he’s paying $200 a month and the model won’t even finish a harmless puzzle.
Modernizing an old Next.js app shows the model’s most frustrating failure mode
In Claude Code CLI, things start promisingly: Opus writes a concise, clean migration plan for Theo’s neglected Ping.gg codebase, still on Next.js 12 and React 17. But because Theo trusted it and didn’t scrutinize the plan, the model went with Next.js 15 instead of 16 — despite being explicitly told to bump everything to the latest versions — because it never bothered to verify what “latest” meant.
The clone-script saga turns one miss into a pattern
Theo then has Opus write a shell script to clone a repo into a hidden quick-clones directory, carry over env files, and switch back to main. The script keeps missing explicit requirements, dragging over untracked files, forgetting branch switching, and hallucinating why env files weren’t copied, even when Theo shows evidence that contradicts the explanation. This is the moment he says he went from “I kind of like using this” to “I literally can’t use this model for any of the things I do.”
His real thesis: the harness is rotting, not necessarily the model
This is the heart of the video. Theo says he doesn’t actually buy the popular claim that Anthropic’s models are simply getting dumber over time; instead, he thinks Claude Code keeps getting worse through bad prompts, tool rules, safety sludge, and brittle permissions behavior. His metaphor is great: if you give a talented carpenter plastic tools and fill the toolbox with mud, the output will look worse — not because the carpenter forgot how to build, but because the setup is broken.
OpenAI comparison, final verdict, and one last self-own
He reruns similar tasks with Cursor and GPT-5.4, and while 5.4 isn’t perfect, it at least admits its training data may be stale, checks the web, and finds Next.js 16. That contrast reinforces his point that OpenAI seems more transparent about harness issues, while Anthropic ships chaos and lets users absorb the cost. He closes by calling Opus 4.7 brilliant in flashes but wildly inconsistent — like Jake’s quote about a model that can touch 30 files correctly and still get one boolean backwards — then accidentally yanks out his own XLR cable while filming the outro, joking that Opus must be rubbing off on him.