Back to Podcast Digest
AI Engineer··18m

Building pi in a World of Slop — Mario Zechner

TL;DR

  • Mario Zechner built Pi because Claude Code stopped being predictable — he says the real breaking point wasn’t UI bugs like the “third iteration of a TUI renderer,” but that Claude Code kept changing prompts, tools, and context handling behind his back with “zero observability.”

  • Minimal harnesses can outperform fancy agent stacks — he points to Terminal Bench, where a bare-bones tmux-keystroke harness ranked near the top in December 2025, often beating native model harnesses despite having no file tools or subagents.

  • Pi’s core bet is that coding agents should be self-modifying and user-shaped — the system prompt is tiny, the toolset is just four tools, and extensions are plain TypeScript modules that the agent can write and hot-reload during the same session.

  • Open source maintainers are getting buried by ‘clankers’ — after Pi was used inside OpenClaw, Zechner says his issue tracker filled with low-quality AI-generated issues and PRs, so he started auto-closing them unless the submitter rewrote the issue in a short, human voice.

  • The bigger warning is not tool design but workflow collapse — Zechner argues teams are using agents to compound “boooos” (errors) with no real bottleneck, creating enterprise-grade complexity in weeks while review agents, giant memory systems, and million-token contexts won’t save them.

  • His practical advice is to slow down and narrow agent scope — let agents handle boring or bounded tasks like repro cases, non-critical work, and hill-climbing loops, but for anything important, “read every line” and keep the amount of generated code within human review capacity.

The Breakdown

Why Mario Fell Out of Love With Claude Code

Zechner opens by saying this is not a roast of the Claude Code team — he clearly admires them — but a story about why he stopped using it after starting in April 2025. His complaint is simple and sharp: the tool stopped fitting his workflow as it accumulated features, broke unpredictably, and hid crucial context changes from the user. The line that sticks is his construction-site analogy: if his hammer breaks every day, he’s mad; if his dev tools break every day, same story.

The Real Problem: Your Context Isn’t Actually Yours

For him, the killer issue wasn’t flicker or TUI bugs — it was that Claude Code silently manipulated context through changing system prompts, tool definitions, and injected reminders that “may or may not be relevant.” He says that kind of meddling confused the model and broke his workflows, while users had almost no observability, model choice, or real extensibility. Even hooks felt flimsy to him, since each one spawned a new process and never offered deep enough control.

Looking Under the Hood and Finding a Different Thesis

He checked alternatives like Amp and Factory, calling them the “Porsche and Lamborghini” of coding harnesses, then drifted toward OpenCode because of his OSS roots. There too, he found things he disliked: pruning tool output under certain token conditions, injecting LSP errors into edit results too early, and even a default server setup that let any website hit the OpenCode server via permissive CORS. Then he found Terminal Bench and loved the irony: one of the best coding-agent harnesses was basically just a model sending keystrokes to tmux and reading the output.

Pi: A Tiny Agent Core That Can Rewrite Itself

That benchmark led to his thesis that coding agents are still in the “mess around and find out” phase, so what developers need is a malleable harness they can shape themselves. Pi is his answer: four packages, a minimal core loop, a tiny system prompt, and just four tools, plus docs and extension examples shipped directly with the agent so it can learn to modify itself. His point is almost mischievous — models already know what coding agents are from post-training, so you don’t need 10,000 tokens of ceremony to explain the job.

Extensions, Hot Reloading, and Letting the Agent Do the Customization

Pi extensions are just TypeScript modules, and the whole setup hot-reloads so you can build in-session and see changes instantly — very game-dev energy, which Zechner explicitly says shaped the design. He shows examples ranging from a joke “talk to the agent while it’s on its main quest” slash command that someone built in five minutes, to Nico’s bizarre multi-agent chat room, to full custom UIs that can even run NES games or Doom. His favorite twist: if you want an extension, you usually don’t build it yourself — you ask Pi to build it for you and then iterate.

Act Two: ‘Clankers’ Are Wrecking Open Source

Things got ugly when Peter put Pi inside OpenClaw’s agent core, which suddenly made Zechner’s repo the target of lots of OpenClaw-generated traffic from users who didn’t even realize it. He calls these low-effort AI-generated contributors “clankers” and says they’re swamping OSS trackers with garbage issues and PRs, so he started fighting back with filters: auto-close the PR, ask for a short human-written issue, and only whitelist accounts that respond like actual people. He also deprioritizes issues tied to OpenClaw interactions, clusters issue embeddings in 3D space, and sometimes just shuts the tracker entirely — “OSS vacation” — to get his life back.

The Bigger Warning: Agents Are Compounding Errors Faster Than Humans Can Feel Pain

The final act is the real sermon: stop bragging that your product was “100% built by agents,” because now it sucks and everybody knows why. Zechner argues agents compound mistakes with serial speed, no bottlenecks, and delayed pain; they learned complexity from the internet’s giant pile of mediocre legacy code, and when specs leave blanks, they fill them with more of that sludge. Review agents and giant memory systems don’t solve this, he says — once the codebase sprawls beyond any model’s usable context, the agent patches locally and breaks things globally.

His Prescription: Narrow the Scope, Keep the Judgment Human

He’s not anti-agent; he’s anti-unbounded agent use. Good tasks are scoped, modular, evaluable, and non-mission-critical — things like repro cases, boring automation, hill-climbing, and bounded research where a human can review the output and keep what’s useful. The closing message is pure Zechner: fewer features, more discipline, polish what matters, and if the code is important, write it by hand or at least read every damn line.