AI EngineerMay 10, 202616m

Two Roads to Durable Agents: Replay vs. Snapshot — Eric Allam, Trigger.dev

TL;DR

Replay logs break down for real agents — Eric Allam argues the classic durable execution model works for workflows like process order, but an agent’s endlessly growing loop of LLM calls and tool calls eventually hits replay limits in size, entries, or versioning complexity.
Agents are sessions, not transactions — his core framing is that workflows have a start and end, while agents persist as long as the user wants, which makes the old stateless backend pattern feel mismatched for multi-hour or multi-day work.
Durable agents need two different persistence layers — Allam splits the problem into an append-only context log for prompts, tool calls, and outputs, plus snapshot/restore for execution state like cloned repos, installed packages, subprocesses, and in-memory data.
Snapshotting compute is what preserves the ‘machine’ side of an agent — instead of reconstructing execution from logs, Trigger.dev snapshots the machine when the user ‘goes to lunch’ and restores it later, making long pauses cheap without losing files, memory, or running processes.
Trigger.dev moved from CRIU to Firecracker microVMs — after shipping CRIU-based snapshots in 2024 and doing millions of restores, they hit limitations around subprocesses, files, and container registry slowness, then switched to full-machine Firecracker snapshots.
Compression made whole-machine snapshots practical — a naive 512 MB VM snapshot was too expensive, but with seekable compression and layering, Trigger.dev got snapshots down to about 14 MB compressed, with snapshot times under a second, restores in a few hundred milliseconds, and roughly 15,000 VM starts per minute.

Summary

Why the usual agent demo falls apart in production

Eric Allam opens with the familiar agent loop — good enough on your laptop, not good enough for production. The real requirements are heavier: long-running work, durability across turns and code versions, and recovery from failures. He frames the whole talk as a backend infrastructure shift, with a joke about a meme where you can’t tell which one is the human and which one is the agent.

The 30-year rule of backends: stateless compute wins

He does a quick history sprint from CGI in 1993 to PHP, LAMP, Rails, Node, and serverless. The common pattern is “request + DB = response,” aka shared-nothing architecture, where meaningful state lives in the database and compute stays disposable. That model dominated because every new request could just redo the work from stored state.

Workflows solved retries, but by forcing everything into replay

As apps got more complex, side effects turned into multi-step async tasks — send email, charge card, resize image, process order. The failure mode is obvious and painful: if send receipt fails, you really do not want to rerun the whole thing and charge the credit card twice. Durable workflow engines fixed that by caching each step and replaying execution, giving you audit trails and resume points, but also forcing code into a rigid, deterministic shape.

LLMs started as workflow steps, then flipped the control flow

At first, LLMs fit neatly into that world as just another step — classify some text and move on. But once tool calling got good, the orchestration inverted: now the LLM is orchestrating the code. Allam says that if you make an agent loop durable with replay, every LLM call and every tool call becomes another journal entry, and the log just keeps growing turn after turn until the system starts to fall over.

The key distinction: context state versus execution state

His cleanest idea is splitting the agent into two halves. One is the append-only context log — system messages, user messages, tool calls, tool results, assistant responses — which is extremely valuable and easy to make durable with databases, object storage, or distributed filesystems. The other is execution state: the messy machine stuff like cloned GitHub repos, installed packages, in-memory datasets, subprocesses, and dev servers.

Why replay can’t preserve the machine, but snapshots can

That execution state, he says, cannot realistically be rebuilt from a log. If the user disappears for a while, you can’t afford to keep the machine live, so the answer is snapshot and restore: freeze the machine, save it, bring it back when the next message arrives. That gives you cheap durability across turns, while the context log separately gives you recovery across crashes and code upgrades.

From IBM checkpoints to CRIU hacks to Firecracker

Allam makes the point that this idea is old: IBM mainframes had checkpoint/restore back in 1966 because expensive jobs couldn’t be rerun from scratch. Trigger.dev first used CRIU, a Linux process snapshot system he describes as injecting a “parasite” into a process to dump its memory, and says they shipped it in 2024 and ran millions of snapshot restores. But it only really captured a process, struggled with things like Chrome and FFmpeg, depended on open files, and got slow once container registries entered the picture.

The Firecracker pivot and the open-source tool coming next

Their answer was moving to Firecracker microVMs, which let them snapshot the entire machine, not just a process. A naive 512 MB snapshot was too costly, so they used seekable compression and layering to shrink snapshots to roughly 14 MB compressed, with snapshots taking a bit under a second and restores only a few hundred milliseconds. He ends by introducing FC Run, a Docker-like CLI for running, snapshotting, restoring, and even forking Firecracker VMs, with benchmarks around 15,000 VM starts per minute — the infrastructure he thinks points toward a stateful-compute future for agents.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

Two Roads to Durable Agents: Replay vs. Snapshot — Eric Allam, Trigger.dev

Summary

Why the usual agent demo falls apart in production

The 30-year rule of backends: stateless compute wins

Workflows solved retries, but by forcing everything into replay

LLMs started as workflow steps, then flipped the control flow

The key distinction: context state versus execution state

Why replay can’t preserve the machine, but snapshots can

From IBM checkpoints to CRIU hacks to Firecracker

The Firecracker pivot and the open-source tool coming next

Was This Useful?

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks

Summary

Why the usual agent demo falls apart in production

The 30-year rule of backends: stateless compute wins

Workflows solved retries, but by forcing everything into replay

LLMs started as workflow steps, then flipped the control flow

The key distinction: context state versus execution state

Why replay can’t preserve the machine, but snapshots can

From IBM checkpoints to CRIU hacks to Firecracker

The Firecracker pivot and the open-source tool coming next

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks