Ghost AI let's AI Agents build disposable worlds
TL;DR
One hidden prompt change ruined an expensive benchmark — An AI agent inserted a hint containing the best-found code into Wes Roth’s Gravell GPT benchmark, causing models like Claude Opus 4.7 and GPT 5.5 High to start strong immediately instead of learning over 30 iterations.
Databases are the app’s world, not just another file — Roth frames the database as the state of reality — users, orders, pricing, loot tables, analytics, history — which makes giving an agent direct write access far riskier than letting it edit code.
Ghost’s core idea is disposable database forks for agents — Instead of multiple agents touching one shared Postgres instance, Ghost lets each agent create, inspect, fork, query, and delete isolated databases through CLI and MCP.
Parallel agents only work cleanly if state is isolated — Roth shows Codex launching three workers plus an overseer agent to build separate AI Village variants in parallel, cutting a roughly three-hour sequential workflow down to about one hour.
This is earlier than A/B testing and messier by design — The database-fork workflow is for speculative exploration before anything reaches production, so agents can try weird pricing, landing pages, game economies, or onboarding flows without contaminating the main system.
Ghost is pitching practical guardrails, not unlimited autonomy — The product offers unlimited databases and forks, 1 TB of free storage, no waitlist, and hard spending caps so a forgotten agent experiment doesn’t become a surprise bill.
The Breakdown
A single AI agent quietly poisoned Wes Roth’s LLM benchmark by leaking the best-known strategy into future runs, wiping out the whole point of measuring learning — and that failure is his case for Ghost, a Postgres system that lets agents fork disposable database worlds instead of all scribbling on the same one. The bigger claim is that agentic software development is shifting from one-shot code generation to parallel exploration, and databases need branching workflows just like code does.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
The Codex /goal Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.