Back to Podcast Digest
AI Engineer18m

Fast Models Need Slow Developers — Sarah Chieng, Cerebras

TL;DR

  • Fast models change the bottleneck from waiting to supervision — Sarah Chieng says Codex Spark runs at 1,200 tokens/sec versus roughly 40-60 for Sonnet or Opus, which means sloppy prompting now produces bad code 20x faster unless developers slow down and actively steer.

  • The speed jump is coming from the whole inference stack, not one trick — she points to hardware changes like Cerebras SRAM-on-chip, disaggregated prefill/decode, MoE architectures, pruning methods like REAP, and inference-layer KV cache reuse across companies like Together, Modal, Fireworks, and Base10.

  • 'Agent swarm' setups look impressive but often just manufacture technical debt — her critique of six terminals, 500-agent swarms, and eight-agent five-screen rigs is that nobody is verifying the output, and faster inference makes that problem much worse.

  • Use smart models for planning and fast models for execution — her suggested workflow is a larger model like GPT-5.3/5.4 for long-horizon planning, then a fast executor like Codex Spark for subagents, repeated tasks, and replaying successful 'skills' from prior sessions.

  • Validation becomes basically free at 1,200 tokens/sec — instead of saving tests, linting, pre-commit hooks, diff reviews, browser QA, and cleanup for the end, she argues these checks should run at every step because they no longer meaningfully slow you down.

  • Context management gets more urgent as models speed up — if a context window used to fill in 10 minutes, a 20x faster model can hit compaction in 30 seconds, so she recommends breaking work into bounded tasks and externalizing state into files like agents.md, plan.md, progress.md, and verify.md.

The Breakdown

The core warning: fast inference will amplify your worst habits

Sarah Chieng opens with a blunt premise: developers picked up bad habits from slow code generation — giant one-shot prompts, massive commits, and too many agents running at once. With Codex Spark generating code at 1,200 tokens per second versus roughly 40-60 for models like Sonnet or Opus, those habits no longer waste minutes; they generate technical debt at industrial speed.

Why models are suddenly getting so much faster

She zooms out to explain that this is a stack-wide shift, not a single model trick. On hardware, she frames the "memory wall" as the villain, noting that 50-80% of inference latency comes from memory movement; that is why systems like NVIDIA's off-chip HBM look different from Cerebras's SRAM-distributed wafer design, where each core can directly access what it needs.

The commercial arrival of disaggregated inference

One of her biggest technical points is that prefill and decode are finally being split across different hardware. Prefill is parallel and compute-bound, decode is sequential and memory-bound, so running both on the same machine is increasingly wasteful; she ties that trend to examples like Nvidia buying Groq for $20 billion and Cerebras partnering with AWS to combine its wafer with Trainium.

Models and inference software are also being redesigned for speed

At the model layer, she calls out mixture-of-experts as the canonical example: activate only part of the model and you get the intelligence of a much larger system at the compute cost of a smaller one. She also mentions pruning techniques like REAP and software-layer optimizations like KV cache reuse from infrastructure players such as Together, Base10, Modal, and Fireworks.

Her takedown of the '10 agents on 5 screens' aesthetic

Sarah has fun with the internet's current flex culture: six Claude Code terminals, 500-agent swarms, eight agents across five screens. Her point is not that parallelism is fake, but that social-media setups often hide the real issue — nobody is checking the code, and with faster models that becomes downright dangerous.

The practical playbook: planner models, executor models, and captured skills

Her first workflow recommendation is orchestration by strength. Use a more capable model like GPT-5.3 or 5.4 for planning and long-horizon reasoning, then hand the actual checklist to fast executors like Codex Spark; if a session goes especially well, capture it as a reusable skill so a small, fast model can replay a verified trajectory over and over.

Validation, cherry-picking, and real-time pair programming

This is where speed becomes liberating instead of reckless. At 1,200 tokens per second, she says validation is "basically free," so tests, linting, pre-commit hooks, diff reviews, browser QA, and automatic refactors should run continuously, not just at the end; she also loves using fast models to generate 15 navbar variations — or 75 via subagents — so the human can cherry-pick the one with the best taste.

Slow down, steer harder, and manage context like it matters

Her strongest behavioral advice is to stop treating AI coding like "spawn a session, get a hamburger, scroll Twitter, come back." Instead, she wants developers acting like real-time pair programmers: constrain the model, ban file deletion, cap diff size, say things like "only change this" or "don't touch types yet," and stay in the driver's seat because "the AI should always be helping you make decisions, not the other way around."

The context window will now betray you faster

She closes on context management with a neat bit of math: if a model used to take 10 minutes to hit compaction, a 20x faster one gets there in 30 seconds. Her answer is external memory and bounded tasks — keep agents.md for roles, plan.md for the checklist, progress.md for state, and verify.md for quality gates — so each new session can pick up cleanly without dragging a bloated context window behind it.

Share