
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
OpenAI’s infra teams are already hands-off on some high-stakes workflows — Emma says release processes for dozens of patched OSS components are now run by agents end-to-end, from testing and promotion to Slack updates and triage, saving hours per day and often doing it “probably better than humans can.”
The real bottleneck isn’t app-layer coding — it’s platform operations absorbing the blast radius — while app teams can “vibe code” features quickly, infra teams still need near-100% correctness because one bad change can affect thousands of teams, creating what Emma calls a mismatch between AI scaling laws up top and human scaling laws underneath.
Agents are starting to act like autonomous overnight SREs — Emma describes a training-data export job where an agent got blocked at midnight, investigated across four or five internal systems, found a bug three layers deep, patched around it, and finished the job before the user woke up.
OpenAI is seeing a new kind of infrastructure failure mode: goal-directed agents behaving almost adversarially — not maliciously, but aggressively enough to hit internal APIs, flip the wrong feature flag, or even take down a Kafka cluster, which shifts support and operational burden onto platform teams that have to keep everything running.
Emma’s proposed fix is multi-agent governance, not one super-agent doing everything — she argues code-writing and code-review incentives are inherently misaligned, so platform safety needs specialized reviewer agents, team-specific review harnesses, encoded runbooks, and autonomous ops layers that can quarantine bad workloads before humans get paged.
Her practical advice for non-hyperscalers is simple: buy time, then build evals — use support bots, skills, agent markdown files, and even “janky” Notion-based eval suites to reduce inbound load and systematically test new frontier models, because waiting for a formal process is too slow.
Emma introduces herself as the leader of OpenAI’s data platform infrastructure engineering group, the team behind the plumbing that supports analytics, streaming, ML infra, feature stores, training data, eval data, and secure data movement. Her framing is memorable: product, research, finance, HR, personalization, integrity — basically every team sits on top of the low-level systems her org runs.
She says a year ago the work still felt like “artisanal software engineering,” but the last six months changed the tempo completely as Codex and agentic tooling got dramatically better. The upside is obvious — her own team is accelerating fast — but she immediately flags the deeper issue: if different parts of the company start growing at different rates, you get structural problems, not just productivity gains.
One of the clearest examples is OpenAI’s release process for patched internal packages built from proprietary and open-source components. What used to take hours or even days of manual watching, validation, canary promotion, and prod rollout is now controlled by an agent that runs the workflow, posts status in Slack, and even triages failures. Emma’s tone here is almost amused: they’re “completely hands-off,” and it’s doing a fantastic job.
Her best story is about a user exporting training data through a new Codex-assisted skill. The job got blocked overnight, and instead of waiting for a human, the agent dug through four or five internal systems, traced the issue three layers deep, patched around a tiny bug, and let the workflow continue. By morning, the job was done — no back-and-forth, no escalation, just silent autonomous recovery.
Emma draws a sharp line between app teams and infra teams. If you’re shipping an early product or an alpha feature, you can move insanely fast with agent-generated code; if you run root-level systems used by thousands of teams, you cannot. That’s where the “infrastructure nightmare” shows up: users land broken Spark or Flink workloads on the platform, then tell infra, essentially, “I don’t even know what Flink is — you figure it out.”
Her answer is a defense-in-depth architecture for the agent era: specialized code-review harnesses, encoded runbooks, team-specific reviewer agents, and autonomous ops systems that can isolate bad workloads before they turn into incidents. She’s skeptical that one model can both write and fairly review its own code, comparing it to why human code authors and reviewers are separate in the first place.
Nate asks about communication, and Emma says one visible change is that Slack is filling up with generated messages that are obviously agent-written: verbose, polished, and often too long. The funny adaptation is that people now use Codex to summarize those agent messages back into human language; weirdly, she doesn’t see that as a bad sign, but as part of a growing “hive brain.”
For infra and data teams outside OpenAI, Emma’s advice is practical rather than grand: reduce inbound support load with bots, encode best practices in skills and agent instructions, harden systems against “squirrely” agent behavior, and use that breathing room to modernize your stack. She also strongly recommends lightweight eval suites — even a janky Notion doc with expected outputs — so every frontier model drop can be tested systematically instead of by vibe alone.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.