Back to Podcast Digest
AI Engineer··19m

Replacing 12K LoC with a 200 LoC Skill — David Gomes, Cursor

TL;DR

  • Cursor replaced a complex worktrees feature with mostly prompts — David Gomes says the old implementation took roughly 12K–15K lines of code, while the new version is essentially a ~200-line markdown-style command/skill setup plus a ~40-line “best of N” prompt.

  • The trick was combining two existing primitives: skills and sub-agents — instead of hardcoding worktree behavior, Cursor now tells the model how to create a worktree, run setup scripts, stay inside it, and optionally spawn multiple model-specific sub-agents for side-by-side comparisons.

  • The new system is more flexible, especially for power users — users can now jump into a worktree mid-chat with /worktree, use it across multi-repo setups, and even ask the parent agent to merge ideas from Opus, GPT, Grok, Kimmy, and Composer after a “best of N” run.

  • What got simpler in code got shakier in behavior — the old system made it physically impossible for an agent to touch files outside its assigned worktree, while the new version is “a bit vibes based,” relying on aggressive prompting and model obedience over long sessions.

  • Model quality matters a lot when the guardrails are prompts instead of hard constraints — Gomes says early evals show smaller models like Haiku more often drift into the primary checkout, while Composer and Grok have performed better at staying scoped correctly.

  • Cursor is already planning a partial swing back toward native UX — even as they improve the prompt-based version with evals, Braintrust-assisted testing, and RL for Composer, they also plan a more fully native worktrees implementation inside Cursor 3.0’s new agent-first interface.

The Breakdown

“Markdown Is Basically the New Code”

David Gomes opens with the big claim: Cursor recently replaced a heavyweight internal feature with “just markdown just a skill.” The talk is about taking something that used to require thousands of lines of code, dependencies, tests, and cleanup logic, and turning it into a much lighter prompt-driven system.

A Quick Refresher on Git Worktrees and Cursor’s Original Feature

He explains worktrees as separate checkouts of the same repo, so multiple agents can work in parallel without stepping on each other. In Cursor, this enabled isolated commands, side-by-side model comparisons, PR creation from individual worktrees, and “best of N,” where different models compete on the same task and you pick the best diff or UI result.

The Old Version Was Powerful — and Expensive

When Cursor shipped this around October alongside Cursor 2.0, it came with a lot of machinery: creating and managing worktrees, isolating agents, running setup scripts, judging outputs, tweaking the harness, adding reminders, and cleaning up the hundreds of worktrees users might leave behind. Gomes says he recently opened a PR that deleted the whole feature implementation — around 15,000 lines of code gone.

Rebuilding Worktrees with Skills and Sub-Agents

The new implementation leans on two existing primitives: agent skills and sub-agents. A /worktree command tells the model how to create a worktree, run any user setup scripts, and most importantly stay in that checkout; the “best of N” version is even smaller, instructing a parent agent to spin up sub-agents on different models, each in its own worktree, then compare the results in a nice table.

Why This Version Is Better for Real Users

Gomes is candid that the biggest win is maintenance: this is a power-user feature, not something 90% of Cursor users touch, so it shouldn’t demand tons of engineering time. But users also gained real flexibility: they can switch into a worktree halfway through a chat, use worktrees in multi-repo setups for the first time, and ask the parent agent to combine parts of different model outputs instead of being forced to pick only one.

The Tradeoff: Hard Guarantees Became “Vibes-Based”

The downside is the one Gomes keeps coming back to: staying in the correct worktree is now enforced by prompting, not by the system making it impossible to escape. He jokes that the model is basically told “operate on this directory” and then “knock on wood please don’t forget,” which gets shakier in long sessions, especially with weaker models that hallucinate or go haywire.

Evals, Braintrust, and Training the Model Not to Wander

To improve this, Cursor is building evals and feeding the lessons into prompts and RL training. Gomes says Braintrust made writing these evals surprisingly easy: he spins up the headless Cursor CLI and scores whether the model did work in the worktree and whether it incorrectly touched the primary checkout; early results already show Haiku drifting more often, while Composer and Grok behave better.

What’s Next: A More Native Return in Cursor 3.0

Even while improving the prompt-based approach, Cursor is also “taking a small step back” and building a more complete native worktrees implementation into the new Cursor 3.0 agent window. Gomes says that UI is a better fit for serious parallel local coding, and he hints they’re also exploring non-git parallelization primitives because worktrees can be slow, disk-heavy, and useless outside git repos.