Back to Podcast Digest
Matthew Berman1h 22m

GPT-Realtime-2, Directionally Bad and Agent Memory

TL;DR

  • Anthropic’s new ‘Dreaming’ feature is basically memory consolidation for agents — Richmond Alake says Claude is likely reviewing past sessions, reinforcing useful patterns, and forgetting low-value details so future replies are faster, cheaper, and more personalized.

  • Agent memory is becoming a real product surface, not just a research idea — Richmond points to Anthropic’s Dreaming, OpenAI’s new memory feedback UI, and older work like Letta’s ‘sleep-time compute’ and Stanford’s Generative Agents as evidence the industry is finally operationalizing memory.

  • The big near-term gains may come from harness engineering, not changing model weights — both Matthew Berman and Richmond keep circling the same point: with tools, memory, and better context management around static models, companies could unlock ‘trillions’ in value without true continual learning.

  • Richmond argues most teams want autonomy but actually need automation — in practice, the useful pattern today is LLM-driven workflows with memory and retrieval, not fully unconstrained agents ‘going wild’ with tools.

  • Oracle showed a concrete memory benchmark where engineered memory flattened token growth across 100 turns — compared with a naive append-everything agent, Richmond says Oracle’s agent memory package kept token use stable while an LLM judge still preferred the memory-engineered answers.

  • Memory engineering is becoming its own discipline — Richmond’s framing is that the hard part is no longer just stuffing context into a model, but deciding what to summarize, retrieve, reinforce, and forget using signals like recency, relevance, and importance.

The Breakdown

‘Directionally Very Bad’ and the most relatable OpenAI text ever

Matthew opens on memes from the Sam Altman firing saga, especially the now-famous exchange between Sam and Mira Murati: “Can you indicate directionally good or bad?” followed by “Directionally very bad.” He lingers on how weirdly human it feels — Satya Nadella texting for updates, Sam stuck outside the boardroom, and that deadpan “Okay” response that somehow says everything.

Richmond Alake joins to decode Anthropic’s ‘Dreaming’

After a quick live troubleshooting session with camera and echo — “hardware is hard, AGI will not save us” — Richmond Alake from Oracle jumps in. Asked about Claude’s new Dreaming feature, he says the name is actually perfect: like human sleep, it’s probably about consolidating useful memory, reinforcing patterns, and discarding noise between sessions.

This isn’t new — Letta, sleep-time compute, and Generative Agents were early

Richmond points to Letta’s work and the “sleep-time compute” paper from Sarah and Charles as being roughly a year ahead of the current wave, while Matthew brings up Stanford’s 2023 Generative Agents paper as a foundational moment. They both light up revisiting that experiment — 1,000 agents in a simulated town forming friendships, making birthday plans, and showing emergent behavior that felt uncanny at the time.

OpenAI, Anthropic, and the race to make memory usable

The conversation shifts from research to product. Richmond notes that OpenAI’s new memory UI — where users can mark memory sources as relevant or not — looks like a human-feedback layer for training better memory selection, while Anthropic appears to be pushing more automatic memory consolidation behind the scenes. His bigger point: memory isn’t solved, but every major lab now sees it as central to cost, quality, and personalization.

Are we just bolting memory onto models that can’t really learn?

A sharp audience question tees up the philosophical core: if transformers can’t hold persistent state, is Dreaming just a workaround? Richmond agrees the current approach is mostly “bolt-on” memory rather than true continual learning, but says that may be enough for now; Dario Amodei’s view, which they cite, is that in-context learning plus better scaffolding could still create multiple trillions in economic value.

Most people ask for autonomy, but what they really need is automation

When Richmond shares his screen, he zooms out and maps the last few years of AI apps: chatbots, then RAG, then today’s shift toward automation and autonomy. His punchline is memorable: “Most people want autonomy, but what they need is automation” — structured workflows with LLMs making bounded decisions, not agents freewheeling through a task with infinite discretion.

Context engineering and memory engineering are the real game now

Richmond walks through context windows, compaction, offloading, retrieval, and why developers need to think about signal-to-noise ratio instead of just stuffing more tokens in. He then introduces “memory engineering” — his term for the discipline of deciding what gets stored, summarized, surfaced, reinforced, or forgotten, often using heuristics like recency, relevance, and importance.

Oracle’s benchmark: stable token use, decent latency, and better answers

To ground the theory, Richmond shows Oracle’s agent memory work: in a 100-turn comparison, a naive agent’s token consumption climbs steadily, while the memory-engineered version stays relatively flat. He says an LLM judge still preferred the engineered outputs, and then gets into the next layer of tradeoffs — latency, KV cache, and the tension between richer memory systems and preserving fast inference.

Final takeaway: 2026 is about the harness

By the end, Richmond’s thesis is clear: the frontier isn’t just better models, it’s the system wrapped around them. He predicts 2026 will be defined by harnesses — memory, tools, retrieval, and self-optimizing scaffolds — while Matthew closes by noting he still needs to test GPT-Realtime-2 voice properly in a future stream.

Share