Back to Podcast Digest
AI Engineer··19m

LLM codegen fails and how to stop 'em — Danilo Campos, PostHog

TL;DR

  • PostHog’s wizard works at real scale — Danilo Campos says 15,000 people a month use the PostHog Wizard to generate integrations, and the proof point he opens with is two fresh unsolicited happy posts on Bluesky and Twitter from the prior six hours.

  • The first failure mode is model rot, not model stupidity — because models are trained on a stale snapshot of the web, PostHog fixes bad invented APIs and fake keys by letting the agent pull fresh markdown docs from posthog.com directly into context.

  • PostHog fights bad codegen patterns with “model airplanes” — instead of full production apps, the team maintains slim reference projects across frameworks and languages so the agent sees the right integration shape, like where login and identity tracking should go, without wasting tokens.

  • They prevent chaos by breadcrumbing the agent instead of over-specifying upfront — the wizard first asks the model to find business-critical files like login, Stripe, or churn-related flows, then names useful events, and only later starts implementation, which keeps 15,000 runs from turning into 15,000 different setups.

  • The biggest source of agent failure was the humans building it — Danilo describes contradictory tool instructions, missing MCP tools, and even JavaScript guidance being fed into Python projects, all discovered by asking the agent after each run: “What could we have done better to set you up for success?”

  • His core thesis is that prose now compounds better than code — the successful wizard is “90% markdown files, 8% tools,” because good plain-text guidance improves as models improve, while code scaffolding depreciates and can overconstrain what he compares to an octopus-like agent.

The Breakdown

Robots Already Bloodied His Nose

Danilo opens with a joke that sets the tone: he’s not afraid of robots because they’ve “already bloodied my nose so many times.” From there he frames the PostHog Wizard as a machine that skips “two hours of misery” and turns it into “8 minutes of pseudo entertainment,” with 15,000 monthly users getting integrations they actually like.

Model Rot Is Why Agents Make Stuff Up

His first concrete failure mode is “model rot”: LLMs are trained on an expensive, slow-moving snapshot of the world, so fast-moving software projects look alien to them. That’s why early agents asked to integrate PostHog would invent keys, patterns, and APIs that didn’t exist, so the team solved it by feeding in fresh markdown docs and letting the agent choose the right up-to-date context.

“Model Airplanes” Teach the Right Shape of an Integration

Danilo says models have clearly scraped plenty of projects with questionable architecture, which explains some bizarre codegen decisions. Their answer is a fleet of “model airplanes” — thin, semi-fake apps across frameworks and languages where the auth is “auth-shaped” rather than fully real, so the agent can learn exactly where tracking belongs without dragging in a giant production codebase.

Breadcrumbing Beats Letting the Agent Freestyle

At PostHog’s scale, 15,000 integrations a month could become 15,000 weirdly different implementations and a support nightmare — his “sorcerer’s apprentice” scenario. So instead of telling the model everything upfront, they lead it step by step: first identify business-value files like login or Stripe, then brainstorm meaningful events, save those, and only then begin the actual integration work.

The Agent Wasn’t the Main Problem — the Team Was

One of the best parts of the talk is his point that the biggest threat to agent outcomes is “ourselves,” not the model. He gives examples: contradictory MCP instructions, a supposedly required tool that didn’t exist for hundreds of runs, and JavaScript instructions accidentally handed to a Python project — all caught because they added a cheap stop-hook question asking the agent what would have helped it succeed.

Trust, Secrets, and Avoiding “Shenanigans” on User Machines

Because the wizard runs on someone else’s computer, trust is everything. Danilo admits early versions read .env files, which was mechanically useful but obviously bad if secrets were getting sent to the cloud, so they locked tool permissions down and replaced broad file access with a tiny purpose-built tool that could only check whether a key existed or write a new value.

The Real Asset Is Markdown, Not More Clever Code

His closing argument is the big philosophical turn: engineers are used to being rewarded for writing more code, but in this world code is a depreciating asset. The wizard that delights users is “90% markdown files, 8% tools,” because strong prose survives model upgrades and gets more valuable over time, while over-scaffolding an agent just constrains what he memorably describes as an octopus that needs room to wriggle.

How the Wizard Is Actually Wired Up

In Q&A, he explains that context is delivered through skill files generated by a context service that flattens those model airplanes into markdown references the model can grep through. Under the hood it uses the Claude Agent SDK wrapped in a CLI, and PostHog even covers inference via its LLM gateway — though he notes that serving this kind of workflow is still messy, with things like Claude Code storing auth info in unexpected places and breaking for users.