Back to Podcast Digest
AI Engineer1h 15m

Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

TL;DR

  • Anthropic says long-running agents are now a harness problem as much as a model problem — Andrew Wilson traced Claude from struggling with bash commands and 20-minute runs to Opus 4.6 hitting roughly 12 hours on a minimal scaffold, with Cloud Code itself now “effectively” writing most of Cloud Code.

  • Self-evaluation is the trap; separate builder and critic roles work much better — Ash Prabaker’s core pattern is a GAN-like generator/evaluator loop where the evaluator uses Playwright to actually click through the app, score it, and send back critique instead of asking one agent to grade its own homework.

  • The big unlock is adversarial contracts, not just better prompts — before coding starts, the generator and evaluator negotiate what “done” means via files on disk, turning a vague spec into testable criteria so the evaluator grades against an agreed contract rather than a fluffy plan.md.

  • Their retro game maker demo showed how scaffolding changes outcomes more than the base prompt — with the same prompt and model, a solo loop produced a decent-looking but non-playable app, while the planner-generator-evaluator harness spent about 6 hours and $200 to produce “Retro Forge,” complete with a working play mode, collision logic, debug HUD, and even an AI level assistant.

  • Anthropic is simplifying harnesses as models improve, not abandoning them — patterns that mattered for Opus 4.5, like forced context resets and tighter sprint decomposition, became less necessary with Opus 4.6, which handled long continuous sessions plus compaction more coherently.

  • The unglamorous secret is still reading traces by hand — Prabaker said the real work was tuning the evaluator’s judgment against human expectations by combing through transcripts line by line, because out of the box Claude was “a really, really bad” QA agent prone to generosity bias.

The Breakdown

From 20-Minute Agents to Multi-Hour Runs

Andrew Wilson opens with a quick history lesson: a year ago, Claude Code could barely write bash commands and would flame out after 20 minutes; now Boris, Cloud Code’s creator, says most of Cloud Code is being written by Cloud Code, and it can run for days. The reason long runs are hard, he says, comes down to context limits, weak planning, and a model’s tendency to overrate its own half-finished work.

The Model Improved — but So Did the Scaffold

Wilson’s main point is that Anthropic’s gains came from co-evolving models and harnesses together. He walks through the stack: computer use, MCP, Claude Code, the Agent SDK, checkpoints, skills, programmatic tool calling, sub-agents, server-side compaction, and 1 million-token context — all the little scaffolding choices that let agents keep working without losing the plot.

The Ralph Loop Era and the First Long-Running Harnesses

He revisits the Ralph Wiggum technique — “deterministically bad in an undeterministic world” — as a surprisingly simple but influential pattern: break work into phases, keep looping until done, and prefer predictable failure over flaky success. Anthropic’s early long-running harnesses added more structure: an initializer agent wrote persistent files like featurelist.json and progress logs, then each fresh session picked one feature, ran tests with Puppeteer, committed working code, and moved on.

Ash’s Big Thesis: Don’t Let the Builder Grade Its Own Homework

Ash Prabaker takes over and pivots from history to what Anthropic is using now. His core idea is a generator/evaluator setup borrowed from GANs: one model builds, another model criticizes, and the critic uses tools like Playwright to inspect the live app instead of politely nodding at diffs and saying “looks done.”

Taste Can Be Graded If You Actually Write the Rubric Down

For front-end quality, Prabaker says Anthropic explicitly scores outputs on design, originality, craft, and functionality — with heavier weight on design and originality to avoid the usual “purple gradients” and generic AI-slop look. The evaluator is calibrated with few-shot examples so its taste converges toward Anthropic’s, and the loop can even throw away an entire design and restart if it keeps scoring badly on one dimension.

The Retro Forge Demo: Same Prompt, Totally Different Result

The contrast is sharp. A plain one-agent run on “build a retro game maker” produced something that looked plausible — splash screen, sprite editor, basic UI — but completely fell apart when you actually tried to play the game. With the harness, after about 6 hours and around $200, the same model produced “Retro Forge,” a much fuller app with a project dialog, a richer sprite editor, an AI level assistant, and a playable game mode with moving characters, collision, and a debug HUD clearly added to help the evaluator test it.

What Actually Made the QA Agent Useful

Prabaker is candid that Claude was initially “a really, really bad” QA agent; it would spot bugs and then shrug them off with something like “fix later, might take 2 weeks.” The fix wasn’t magic — it was relentless prompt tuning, detailed contracts, and reading traces by hand until the team could see exactly where the model’s judgment diverged from human judgment.

The Frontier Moved, So the Harness Got Simpler

The final message is not “build this exact harness,” but “adapt to the current model.” Anthropic dropped some older tricks like forced context resets and stricter sprint slicing once Opus 4.6 got better at long continuous sessions, but kept the planner-generator-evaluator core, filesystem state, and harsh external evaluation. In the Q&A, both speakers keep coming back to the same practical advice: use sub-agents, Playwright MCP or Claude for Chrome, store shared state on disk, and if you want to know why your agent failed, read the whole trace.

Share