Harness Engineering: How to Build Software When Humans Steer, Agents Execute — Ryan Lopopolo, OpenAI
TL;DR
Ryan Lopopolo’s core bet is that “code is free” now — with GPT-5.2-class agents, he argues implementation is no longer the bottleneck, and each engineer can effectively direct 5, 50, or 5,000 “engineers” depending on token budget and GPU access.
Harness engineering is about feeding the right instructions at the right moment — instead of stuffing everything into one prompt, Ryan uses docs, lints, tests, reviewer agents, and CI comments as just-in-time context injection so agents learn what “good” looks like when it matters.
OpenAI’s internal workflow starts with the agent, not the editor — Ryan says his team banned touching editors, gives tickets directly to Codex, and built repo-native skills for launching apps, observability stacks, and Chrome DevTools so the agent is the primary operator.
The biggest gains came from eliminating repeat review feedback, not writing more code — his team set aside “garbage collection day” every Friday to turn recurring PR complaints into durable docs, lint rules, and reviewer agents, steadily reducing “slop” and human review burden.
He spends over a billion output tokens a day because planning, docs, implementation, and CI all consume serious budget — Ryan estimates his usage is roughly split a third each across planning/ticket curation/docs, implementation, and CI/review, with the explicit goal of saturating work 24/7.
His endgame is autonomous product execution over quarters, not autocomplete in the IDE — the future he describes is handing agents a ranked backlog, success metrics, and reliability goals, then letting them continuously ship, test, triage, and improve software with humans steering priorities.
The Breakdown
Code Is Free, So the Job Changes
Ryan opens with a provocation: for the last nine months, he’s built software “exclusively with agents,” even banning his team from touching their editors. His thesis is blunt — implementation is no longer scarce, code is free, and the real job now is deploying abundant agent capacity against real problems.
The Three Scarce Resources: Human Time, Attention, Context
He reframes the bottlenecks as human time, human/model attention, and context window. In the old world, P3s died in the backlog; in the new one, you can fire off four versions in parallel, pick the best, and ship — which is how he talks about internal OpenAI tools getting localization for London, Paris, Brussels, Zurich, and Munich “from day one.”
Guardrails Matter More Than Code
Ryan says the valuable artifact isn’t the code itself but the prompt and guardrails that produced it. That means ADRs, ticket history, persona-oriented docs, and logs of past reviews all become crucial because they encode the non-functional requirements agents otherwise won’t infer correctly from a giant codebase.
How He Teaches Agents Not to Write Slop
One of his most memorable lines is basically: you can just tell agents “do not produce slop,” but only if you’ve written down what slop means in your environment. He describes using reviewer agents for security and reliability, bespoke lints for things like retries and timeouts around network code, tests that enforce source-code structure like a 350-line file limit, and error messages that don’t just fail but explain the remediation path.
Prompt Injection, Everywhere, on Purpose
Ryan jokes that modern software engineering with agents is mostly finding increasingly niche places to insert prompts. Rules files, skills, lint failures, PR comments, test harnesses, and even agent-written prompt-writing skills all become ways of continually refreshing context as auto-compaction pages old context out.
The Real Workflow: Tickets Into Codex, Humans Out of the Loop
In the Q&A, he gets concrete: the entry point is Codex, not a human-crafted shell. His team hands the agent a ticket plus a small set of skills, and the repo is designed so Codex can launch the app, bring up observability, connect DevTools, and work through local tooling without humans babysitting every environment change.
Garbage Collection Day and the End of Human Review as Bottleneck
When each engineer started generating three to five PRs a day, human review became the blocker and merge conflicts got ugly fast. Their answer was “garbage collection day” every Friday: catalog the repeat feedback, turn it into docs, reviewer agents, or lint rules, and make that whole category of review friction disappear.
The Endgame: A Quarter of Work, One Token Budget
Ryan’s closing vision is bigger than coding copilots: he wants to hand over a quarter, a half-year, or a year of ranked work plus success metrics and let agents keep shipping without him clicking “continue.” He even describes tethering his laptop in the back seat during his commute so tasks can keep running — a funny image, but also the whole philosophy in miniature: if the agent needs you to poke it, the harness failed.