AI EngineerJune 29, 202614m

Your Agent Failed in Prod. Good Luck Reproducing It. - Tisha Chawla & Susheem Koul, Microsoft

TL;DR

Temperature zero is not truly deterministic: Reddit and Hacker News threads show that running the same prompt 1000 times at temperature zero can still produce dozens of different responses due to GPU non-determinism and mixture-of-experts routing.
The real culprit is batch variance: Requests get grouped with whatever else hits the server that millisecond, and if a batch overflows a subnetwork, tokens get rerouted differently.
Stop chasing bitwise determinism, chase replayability: You need to capture what the agent actually did, not force it to repeat the same tokens.
Record at boundaries, not the network layer: Half your agent never touches the network, so capture what enters and exits each node instead of packets.
Recorded traces become free test cases: Chronicle lets you stub LLM nodes while running tool calls live, letting you verify fixes without spending on model calls.
Keep generation-time variation alive: Pinning temperature to zero kills creativity; embrace randomness while building robust observability.

The Breakdown

When an AI agent sells $190,000 worth of stock instead of $1,000, you reach for the logs and replay the prompt locally only to find it works perfectly every time. Tisha Chawla and Susheem Koul from Microsoft explain why temperature zero does not guarantee determinism and introduce Chronicle, a tool that captures agent runs at method boundaries so you can replay failures without calling the model.