Your Agent Failed in Prod. Good Luck Reproducing It. - Tisha Chawla & Susheem Koul, Microsoft
TL;DR
Temperature zero is not truly deterministic: Reddit and Hacker News threads show that running the same prompt 1000 times at temperature zero can still produce dozens of different responses due to GPU non-determinism and mixture-of-experts routing.
The real culprit is batch variance: Requests get grouped with whatever else hits the server that millisecond, and if a batch overflows a subnetwork, tokens get rerouted differently.
Stop chasing bitwise determinism, chase replayability: You need to capture what the agent actually did, not force it to repeat the same tokens.
Record at boundaries, not the network layer: Half your agent never touches the network, so capture what enters and exits each node instead of packets.
Recorded traces become free test cases: Chronicle lets you stub LLM nodes while running tool calls live, letting you verify fixes without spending on model calls.
Keep generation-time variation alive: Pinning temperature to zero kills creativity; embrace randomness while building robust observability.
The Breakdown
When an AI agent sells $190,000 worth of stock instead of $1,000, you reach for the logs and replay the prompt locally only to find it works perfectly every time. Tisha Chawla and Susheem Koul from Microsoft explain why temperature zero does not guarantee determinism and introduce Chronicle, a tool that captures agent runs at method boundaries so you can replay failures without calling the model.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
The Cheapest Model That Passes
OpenRouter lists 400 models behind one API. The fix for choosing isn't a better leaderboard, it's a four-step protocol that ends in a real eval.

Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.