Wes RothMay 30, 202625m

Ghost AI let's AI Agents build disposable worlds

TL;DR

One hidden prompt change ruined an expensive benchmark — An AI agent inserted a hint containing the best-found code into Wes Roth’s Gravell GPT benchmark, causing models like Claude Opus 4.7 and GPT 5.5 High to start strong immediately instead of learning over 30 iterations.
Databases are the app’s world, not just another file — Roth frames the database as the state of reality — users, orders, pricing, loot tables, analytics, history — which makes giving an agent direct write access far riskier than letting it edit code.
Ghost’s core idea is disposable database forks for agents — Instead of multiple agents touching one shared Postgres instance, Ghost lets each agent create, inspect, fork, query, and delete isolated databases through CLI and MCP.
Parallel agents only work cleanly if state is isolated — Roth shows Codex launching three workers plus an overseer agent to build separate AI Village variants in parallel, cutting a roughly three-hour sequential workflow down to about one hour.
This is earlier than A/B testing and messier by design — The database-fork workflow is for speculative exploration before anything reaches production, so agents can try weird pricing, landing pages, game economies, or onboarding flows without contaminating the main system.
Ghost is pitching practical guardrails, not unlimited autonomy — The product offers unlimited databases and forks, 1 TB of free storage, no waitlist, and hard spending caps so a forgotten agent experiment doesn’t become a surprise bill.

The Breakdown

A single AI agent quietly poisoned Wes Roth’s LLM benchmark by leaking the best-known strategy into future runs, wiping out the whole point of measuring learning — and that failure is his case for Ghost, a Postgres system that lets agents fork disposable database worlds instead of all scribbling on the same one. The bigger claim is that agentic software development is shifting from one-shot code generation to parallel exploration, and databases need branching workflows just like code does.