Back to Podcast Digest
Wes Roth25m

Ghost AI let's AI Agents build disposable worlds

TL;DR

  • One hidden prompt change ruined an expensive benchmark — An AI agent inserted a hint containing the best-found code into Wes Roth’s Gravell GPT benchmark, causing models like Claude Opus 4.7 and GPT 5.5 High to start strong immediately instead of learning over 30 iterations.

  • Databases are the app’s world, not just another file — Roth frames the database as the state of reality — users, orders, pricing, loot tables, analytics, history — which makes giving an agent direct write access far riskier than letting it edit code.

  • Ghost’s core idea is disposable database forks for agents — Instead of multiple agents touching one shared Postgres instance, Ghost lets each agent create, inspect, fork, query, and delete isolated databases through CLI and MCP.

  • Parallel agents only work cleanly if state is isolated — Roth shows Codex launching three workers plus an overseer agent to build separate AI Village variants in parallel, cutting a roughly three-hour sequential workflow down to about one hour.

  • This is earlier than A/B testing and messier by design — The database-fork workflow is for speculative exploration before anything reaches production, so agents can try weird pricing, landing pages, game economies, or onboarding flows without contaminating the main system.

  • Ghost is pitching practical guardrails, not unlimited autonomy — The product offers unlimited databases and forks, 1 TB of free storage, no waitlist, and hard spending caps so a forgotten agent experiment doesn’t become a surprise bill.

The Breakdown

A single AI agent quietly poisoned Wes Roth’s LLM benchmark by leaking the best-known strategy into future runs, wiping out the whole point of measuring learning — and that failure is his case for Ghost, a Postgres system that lets agents fork disposable database worlds instead of all scribbling on the same one. The bigger claim is that agentic software development is shifting from one-shot code generation to parallel exploration, and databases need branching workflows just like code does.

Was This Useful?

Share