How does Claude Code *actually* work?
TL;DR
A harness is the tool layer around the model, not the model itself — Theo defines it as the set of tools plus the environment that lets an LLM do real work like reading files, running bash, and editing code, even though the model itself only generates text.
Harness quality can swing benchmark results dramatically — citing Matt Mayer’s independent benchmark, Theo says Opus jumped from 77% in Claude Code to 93% in Cursor with the harness being the key difference.
Every tool call is a stop-and-restart loop — the model emits a structured tool request, the harness executes it, appends the output to chat history, and then makes a fresh API call so the model can continue with updated context.
Context beats brute-force code dumping — Theo argues that tools plus search made projects like RepoMix mostly obsolete, because stuffing entire codebases into context creates a 'needle in a haystack' problem and models get worse past roughly the 50k–100k token range.
You can build a basic coding harness in about 60–200 lines — using articles from the AMP team and Mah, he walks through a tiny Python agent with just three core tools: read files, list files, and edit files, then shows it can be reduced even further to basically a bash tool.
Cursor feels smarter because its prompt and tool design are tuned harder — Theo shows that changing a single tool description from 'use bash instead' to 'deprecated' flips model behavior, which explains why teams like Cursor get better outcomes by obsessively refining prompts and tool schemas per model.
The Breakdown
Theo starts with the real question: what is a harness?
Theo opens by poking fun at AI buzzwords like “agentic coding” and “vibe coding,” then says “harness” is at least a term that means something specific. He frames the stakes with Matt Mayer’s benchmark: Opus scored 77% in Claude Code and 93% in Cursor, which he says came down to the harness, not the model.
The model can only write text — the harness makes that text do things
His core explanation is simple: an LLM cannot actually touch your computer; it only predicts text. Tool calling is the trick — the model emits special syntax like a bash request, then the harness executes it in code, optionally asks for permission, and feeds the result back into the conversation.
Why Claude Code can keep going after a command finishes
Theo walks through the hidden loop: the model stops when it emits a tool call, the harness runs the command, appends the output to history, and sends a new request to the same model. That means the “brain” is effectively paused and restarted every time, which becomes his key mental model for how coding agents actually work.
Context management matters more than people think
He demonstrates that Claude Code starts with zero knowledge of a repo and has to search, read files like package.json, and build context incrementally. Then he adds a mischievous Claude.md telling the model to mock the user, and Claude answers instantly without tool calls — proving that if context is already in history, the model doesn’t need to explore.
Why dumping the whole repo into context is mostly the wrong move
Theo takes a swing at the old “just cram the whole codebase in” theory and says tools made that worldview obsolete. He compares it to giving someone 2 files versus 2,000 files to debug while their memory resets every 30 seconds, then points to results showing Sonnet’s accuracy falls sharply once context grows past around 50k–100k tokens.
Building a tiny harness is shockingly straightforward
Using writeups from the AMP team and Mah’s “The Emperor Has No Clothes,” he shows a minimal Python coding agent with read-file, list-files, and edit-file tools, a system prompt that explains how to call them, and a loop that parses tool invocations and appends results back to history. The punchline is that the whole thing is not magic — more like 60 to 200 lines of code.
Then he strips it down even further: maybe bash is all you need
Theo rewires the demo so the model only has a bash tool, and it still manages to inspect files and edit the project. He’s clearly delighted by this, calling out how wild it is that giving the model one executable interface can turn “advanced autocomplete” into a self-modifying coding tool.
Why Cursor wins, and where T3 Code fits
In the final stretch, he answers the practical questions: Cursor’s harness performs better because teams there obsess over tool descriptions, system prompts, and model-specific tuning. He proves how sensitive this is by changing one tool description to “deprecated,” watching behavior flip, then lands the distinction: T3 Code is not a harness at all — it’s a polished UI layer sitting on top of harnesses like Claude Code and Codex.