Back to Podcast Digest
You're absolutely right!··49m

No hype: how real engineers are using AI in production today

TL;DR

  • Real engineers are still all-in on AI coding, but not autonomously — Wes says he codes with LLMs “100% of the time” and pegs the gain at roughly 4–5x, while both hosts describe today’s sweet spot as fast assisted coding plus human review, not hands-off agents.

  • The winning workflow is getting thinner, not more elaborate — instead of piling on skills, agents.md files, and orchestration hacks, they’re betting on better base models like Codex 5.3/5.4 and building just enough validation infrastructure so the model can prove its work through APIs, UI snapshots, and CLI checks like Bruno.

  • AI’s biggest bottleneck isn’t writing code anymore — it’s trust and validation — the hosts keep coming back to the same problem: legacy codebases, databases, caches, event systems, and cross-service workflows are hard to reproduce in a sandbox, which makes “fully autonomous” agents risky in production.

  • More code does not automatically mean more business value — they quote Primeagen’s line that 2026 could be “the year of productivity masturbation,” arguing that AI can flood teams with shipped tasks and experiments while still reducing the deep thinking, taste, and product coherence that create real value.

  • Harnesses matter more than people realize — their explanation of an AI “harness” is basically all the machinery around the model: tools, memory, permissions, loops, context management, retrieval, and execution environments; Wes points to evals where Opus scored around 70% in one setup versus 95% in Cursor, underscoring how much the wrapper changes outcomes.

  • Agent products like OpenClaw and Hermes feel more hyped than indispensable — after struggling to make OpenClaw work and getting somewhat better results from Hermes, Wes says the best real use so far is lightweight personal automation like daily SEO checks with Google Search Console and PostHog, not anything transformative for production teams.

The Breakdown

The state of AI coding: still useful, less magical

They open by separating internet hype from what engineers are actually doing day to day. Wes is blunt: he still uses AI for basically all coding and sees a real 4–5x boost, but not the “10x or 100x” mythology people throw around online. The vibe is less breathless than a few months ago — not because it stopped working, but because everyone is numb and overloaded.

Thin harness, strong models, real validation

Wes says the biggest shift in his workflow is going all-in on Codex for backend and logic work, while still switching to Claude Opus 4.6 for UI because “Codex is crappy for UIs.” Instead of collecting tools and skills, he’s trying to build an environment where the agent can validate itself — calling APIs, rendering UIs, taking snapshots, and catching things like currency values being sent in dollars instead of cents. That, to him, is where the leverage is now.

Why they don’t love persistent skills and markdown instruction files

Both hosts are skeptical of stuffing codebases with skills, agents.md files, and permanent prompt scaffolding. Their argument is practical: the agent can already inspect files and run commands, and all those extra instructions eventually rot, turn into noise, and can even make newer models worse by freezing old assumptions in place. Wes says he talks to the model like he’d talk to a teammate, not like he’s summoning a “really smart staff engineer.”

Shipping faster vs creating actual value

This is the philosophical center of the episode. Scott wonders whether hand-coding forced deeper thought, while AI lets teams blast through tasks so fast that they may lose the reflection that leads to better product decisions; he quotes Primeagen calling 2026 “the year of productivity masturbation.” Wes lands in the middle: yes, speed helps experimentation, but if you use AI to throw 20 fragmented ideas at customers instead of building one coherent product with taste and vision, you may just create more cleanup work.

Claude Managed Agents and the real meaning of a “harness”

Scott walks through Anthropic’s Claude Managed Agents as a prebuilt cloud setup for long-running, asynchronous jobs: isolated containers with file systems, tool use, internet access, loops, and recovery built in. That leads into a clear explanation of “harness” as everything around the raw model call — tool execution, memory, permissions, context handling, agent loops, and infrastructure. They stress that this wrapper often matters as much as the model itself, especially when context management is the whole game.

Context is the actual AI engineering job now

From PDF search to customer account context, they argue the hard part isn’t calling an LLM — it’s deciding what information to send, when, and in what form. Scott praises systems like AMP for being good at truncating history, summarizing conversations, and retrieving the right chunks instead of just flooding the model with everything. His framing is sharp: AI engineering has become “managing context and managing the harness.”

Why autonomous agents still hit a wall in real codebases

When they bring it back to production software, the tone gets cautious again. Wes says the dream of autonomous agents breaks on reality: legacy repos weren’t built for agents, and real trust requires spinning up databases, Valky/Redis-style caches, AWS EventBridge flows, and full reproducible environments, not just isolated unit tests. They mention Vercel’s “Agent Responsibly” post as a warning that agents can optimize for green CI while still failing to fix the real bug.

OpenClaw, Hermes, and the gap between agent demos and daily life

Near the end, Wes talks through trying OpenClaw, failing to get it stable, then moving to Hermes on a Beelink box because it was easier to configure and showed tool calls more transparently. The most useful thing so far is modest: daily SEO and traffic checks using app links, Google Search Console, and PostHog, with some automatic PRs for metadata changes. Their conclusion is pretty grounded — maybe a few people have killer assistant workflows, but for most users these products still feel like more hype than durable value.