Alcreon
Back to Podcast Digest
Armin Ronacher··1h 38m

State of Agentic Coding #5 with Armin and Ben

TL;DR

  • Both hosts think the real bottleneck isn’t coding anymore — it’s review, architecture, and cleanup — Ben says he’s back in the editor mostly for diff review, built his own tool called Hunk, and Armin argues the future IDE may be more about reviewing agent output than writing code.

  • Agentic coding is pushing teams to ship faster than their processes can handle — Armin says 7 out of 10 teams he interviewed had non-engineers submitting PRs, one FAANG-like company tracks token burn on internal leaderboards, and he heard examples of people who hadn’t written code in 15 years suddenly shipping again.

  • The security story is getting worse in two directions at once — today’s models already generate buggy code easily, but with the right harness they’re also “shockingly good” at finding vulnerabilities, which Armin says is driving more zero-days, duplicate disclosures, supply-chain attacks, and highly targeted phishing.

  • They think ‘slop theater’ is now a real genre of AI coding — the episode calls out projects that flaunt 25-agent canvases, giant skill files, or 37,000 lines of code a day, arguing the performance of parallelism and token burn often matters more than software quality, refactoring, or architecture.

  • Armin’s sharpest takeaway: AI is better at finding mistakes than preventing them — he says if you use agents to go faster without slowing down for deliberation, they’ll happily create invalid states, excessive fallback logic, and complexity that future models probably won’t be able to untangle.

  • A new kind of tooling discrimination is coming — both predict people will judge PRs by what model or harness produced them, with examples ranging from OpenCode maintainers preferring Codex over Opus to future companies likely standardizing on one approved stack for security, compliance, and review.

The Breakdown

The IDE Isn’t Dead — But It’s Shrinking Into a Review Tool

They open by revisiting an old prediction: are developers still using IDEs? Ben says yes, but mostly for code review, not for writing, and Armin reads Cursor 3’s launch as a sign the center of gravity is shifting toward a ChatGPT-style textbox rather than a traditional editor. The vibe is: the editor isn’t gone, but the thing people miss most is a good diff.

Hunk, Diff Review, and the Search for Better Agent Ergonomics

Ben explains why he built Hunk: he trusts LLMs to describe the mechanical work they did, but not to reason reliably about architecture, so he wants a diff tool where the agent can annotate what changed. Armin adds that review comments now split three ways — feedback for the agent, for your own brain, and for another human reviewer — and tooling hasn’t really caught up. Both see diff review as one of the most fertile product areas in agentic coding right now.

Slop Forks Are Becoming Infrastructure Strategy

From there they move into “slop forks”: reimplementations or compatibility layers like Chardat and Cloudflare’s new WordPress-compatible CMS, Mdash. Armin’s explanation is practical, not snarky — Cloudflare runs V8 isolates, not full Node, so agents are suddenly useful for rewriting software to fit its constraints, like no binary extensions and tight memory budgets. The bigger idea: AI lowers the cost of alternative runtimes and weird platform bets that previously died on compatibility.

Shipping Too Fast Is Turning Into an Industry-Wide Problem

This is the core of the episode. Armin says GitHub activity is measurably up, GitHub availability is down, and after a bunch of private interviews with engineering teams he keeps hearing the same thing: everyone is shipping too fast. He contrasts last year’s joyful solo experimentation with today’s pressure cooker, where entire organizations expect that same pace, token spending becomes a proxy for performance, and marketers, PMs, and long-removed ex-engineers are suddenly submitting code.

Bugs, Security, and Why Frontier Models May Be Held Back

They connect recent incidents — leaked Claude Code internals, Railway’s exposure, an Amazon outage attributed to AI, and private security disclosures Ben has received — into a larger pattern. Armin says present-day models in the right security harness are already remarkably good at finding vulnerabilities, even in code they’ve never seen, including attempts to escape his own agent sandbox. That leads into the darker speculation: maybe frontier labs hold back models partly because stronger public models would flood the world with exploitable findings faster than teams could patch them.

“Slop Theater” and the Weird Psychology of Prompting All Day

Ben coins “slop theater” for the performative side of agent coding: 25-agent canvases, giant markdown skill stacks, and endless screenshots of parallel work. Armin uses Gary Tan’s GStack and other giant codegen-heavy repos as examples of systems optimized for token burn and output volume, not sane software engineering, then lands a much more human point: watching someone prompt agents all day can feel like watching someone on drugs. His most memorable observation is that developers now fill five sandboxes at once, exhaust five context windows by afternoon, and don’t notice their own judgment collapsing with them.

Good Foundations Make Agents Feel Magical; Messy Products Make Them Dangerous

Armin says agents feel amazing on his well-engineered Rust libraries like MiniJinja and Similar, where years of careful abstraction give them solid footing. On Arendelle’s product codebase, though, feature flags, entitlement logic, and cross-cutting state make them excellent at adding entropy and terrible at removing it. He compares this to an old Halo: Master Chief Collection matchmaking project full of invalid state combinations — the kind of mess humans eventually escalate and redesign, while agents just keep extending it.

The Next Phase: Model Discrimination, Tool Lockdown, and Expensive Computing

Near the end they predict two social shifts. First, people will start discriminating against PRs based on which model or harness produced them — Armin mentions OpenCode maintainers preferring Codex over Opus, and both think “what generated this?” will become a real review filter. Second, companies will likely standardize approved agent stacks for security and compliance, even if engineers rebel and pay for Claude or other tools out of pocket; all of this is happening while serious users are now casually spending around $500/month each on subscriptions and maxed-out Macs just to keep up.