Code Mode - Sunil Pai, Cloudflare
TL;DR
Tool calling breaks at real scale, so Cloudflare started having models write JavaScript instead — Sunil Pai says stuffing hundreds of tools into context gets slow and brittle, while generated code can loop, hold state, parallelize, and hit many APIs in one execution.
Cloudflare compressed a 2,600-endpoint API into two tools:
searchandexecute— Matt Carey’s setup lets the model inspect the full OpenAPI spec and then run code against it, cutting an impossible 1.2–1.5 million token prompt down to about 1,000 tokens.The point isn’t just efficiency — it’s a new interface model where the LLM ‘inhabits the state machine’ — Sunil’s big example is Kenton Varda’s drawing app, where the model stopped trying to generate a separate tic-tac-toe program and instead read raw stroke data, recognized the board, and played directly on the canvas.
Safe execution becomes the core architecture problem — Sunil argues the real product is a harness: a fast sandbox with no default powers, explicit capability grants, blocked outgoing fetches by default, and full observability so you can trace why an agent did something expensive or dangerous.
This could blow up traditional UI assumptions — instead of forcing every user through the same generic workflow, Sunil imagines per-user generated interfaces for tasks like returns, delayed orders, or shopping under a budget, all assembled on the fly from backend capabilities.
Companies need to start designing for agents as first-class users — his closing message is that ‘your next billion users are these little robots,’ so docs, errors, search, and typed APIs should be built for systems that ‘dream in types and syntax errors.’
The Breakdown
Why Sunil Pai thinks tool calling falls apart
Sunil opens with a familiar pain point for anyone building AI apps: tool calling feels fine with a handful of tools, then turns ugly once you cram in Google services, Jira, wikis, and “hundreds hundreds of tools.” His alternative is simple but consequential: stop doing endless JSON back-and-forth, and ask the model to generate code — usually JavaScript — that runs against an environment in one shot.
The Cloudflare API test: 2,600 endpoints, two tools
The first real proving ground was Cloudflare’s own API surface, which has around 2,600 endpoints. Sunil says exposing each one as a tool would cost roughly 1.2 to 1.5 million tokens up front, so Matt Carey built a system with just two tools, search and execute, both of which take code as input. The result: the model can inspect the whole OpenAPI spec, generate code, and act — dropping the prompt size to around 1,000 tokens, which Sunil calls basically a 99.9% reduction.
The live demo that half-broke and still made the point
He tries a live demo asking the system to “list my workers,” with read-only access and a lot of stage anxiety. The model searches for relevant endpoints, emits code, and starts running it, only to hit JavaScript errors and pagination weirdness — classic on-stage demo karma. Sunil leans into it, joking that it worked in rehearsal and that he probably needs to pay for a better Mythos model, but the larger point survives: the system can turn a broad API surface into a single executable program instead of eight slow round trips.
From coding assistant to “inhabiting the state machine”
Then the talk gets intentionally “a little woo-woo.” Sunil contrasts how a programmer would solve “rename 200 photos on your desktop” — open an IDE, script it, maybe call a vision model — with how nontechnical users are usually stuck with some janky $7/month app and a daemon that’s “stealing your crypto.” His thesis is that LLMs collapse that divide because everyone now has access to something that can write and run code against exposed systems.
Kenton’s tic-tac-toe story changed how they think
The most memorable moment is Kenton Varda’s sketch app experiment. Kenton drew a tic-tac-toe board and told the model to play; at first it started generating a whole new tic-tac-toe app, and Kenton stopped it: no, read the existing stroke array and use that state. The model recognized the board from raw strokes, saw the X in the top left, and drew an O in the center — no game logic existed anywhere — which led Sunil to say it had stopped generating a program and started “inhabiting the state machine.”
The harness: safe eval, explicit capabilities, total observability
From there, Sunil frames the real architecture as a harness: a sandbox that starts with zero capabilities and only gets explicit APIs you grant it. He emphasizes capability-based security, fast startup, no outgoing fetches by default, and full observability — you need to know why, say, “last Tuesday it made a trade for $2.3 million for llama poop.” The implementation could be V8 isolates, WASM, or something else; the important thing is fast, embeddable, constrained execution.
Generated UI, local mashups, and building for robot users
In the final stretch, Sunil pushes the idea beyond one-off code execution into long-running workflows and generative UI. He imagines e-commerce experiences that generate different interfaces per user — returning shoes under $100 versus checking a delayed order — instead of one bland UI trying to satisfy everyone. He closes with the line that sticks: your customers are still humans, but “your next billion users are these little robots,” so product teams need docs, errors, search, and APIs built for agents that “hang out in registries” and “dream in types and syntax errors.”