Deterministic replay: debugging agents that will not reproduce.

An agent processing a batch of refund requests does something catastrophic at 2 a.m.: on one ticket it approves a refund roughly a hundred times the order value. The on-call engineer sees it in the morning. The trace shows what the agent did — the tool calls, the outputs, the final decision — but not why that decision instead of the right one. So the engineer does the obvious thing: re-runs the agent on the same ticket to watch it fail.

It does not fail. It handles the ticket correctly, ten times running. The bug that cost real money is now a ghost — it happened, it is on the books, and it cannot be made to happen again. You cannot fix what you cannot reproduce, so the incident closes as “could not reproduce,” which means it is still live and will fire again on some future ticket nobody can predict.

This is the defining debugging problem of agents, and no dashboard closes it. It is structural: an agent run is non-deterministic, so the central assumption of debugging — that you can run the failing case again and observe it — does not hold. Deterministic replay restores that assumption. This post is about what it records, how it replays, and what the harness around it looks like.

Why an agent run will not reproduce

Re-running an agent on the “same” input does not re-run the same execution, because the input is not the only thing that determines the run. At least five sources of non-determinism sit inside a single agent loop:

Source	Why it varies between runs
Model sampling	Token generation samples from a distribution; temperature above 0 yields different completions. Even at temperature 0, batching, hardware, and model-version drift creep in.
Tool & API responses	A CRM, a search API, a database can each return a different result for the same call — live data moved, pagination shifted, a transient error fired.
Retrieval results	A vector index returns the top-k most similar chunks; updated, re-embedded, or approximate, it returns a different set than yesterday.
Timing	A clock read, a date, a “current price,” a timeout that fires at one latency and not another — all feed the run and change between executions.
Concurrency	When steps or tool calls run in parallel, the order results arrive varies, and a model conditioned on A-then-B can decide differently than on B-then-A.

The refund bug lived in here somewhere — a CRM that returned a malformed value once, a sampled token that sent the reasoning down an unusual path, a retrieval that pulled a stale policy chunk on that one call. The engineer re-running the agent gets fresh draws from all five sources and lands on a different execution — almost certainly a correct one, because the bug needed a specific, unlucky combination the re-run does not recreate.

Why this makes agents uniquely hard to debug

Ordinary software has non-determinism too — concurrency, network, clocks — and the field handled it. What makes agents harder is that non-determinism is not at the edges of the system; it is the engine in the middle of it.

In a normal service the core logic is deterministic: given the same inputs, the same code path runs. Non-determinism enters at known boundaries — the network call, the thread schedule — and you mock those boundaries and pin the logic. A bug reproduces reliably once you control the edges.

An agent inverts that. The decision-making core — the model — is itself a sampling process. There is no deterministic core to pin; the part you most need to debug is the non-deterministic part. A long run is a chain of sampled decisions, each conditioned on the sampled outputs before it. A bug is rarely “this line is wrong.” It is “this sequence of draws, against these tool responses, at this timing, produced a bad decision” — a path through a combinatorial space that re-running explores randomly and almost never revisits.

The agent-observability essay argues that an agent fails silently, every infrastructure metric green, and that you need step-level tracing to see it fail. That is true and necessary, and it is not this. Tracing tells you, after the fact, what a run did — the sequence of steps, intents, and results. It is a flight recorder; it does not let you re-run the flight. When the trace shows a bad decision and the question is “why that decision,” you need to put the agent back in the exact state it was in and step through it. Tracing observes; replay reproduces. A serious agent needs both, and they are different machinery.

Deterministic replay is borrowed, not new

Agents did not invent non-deterministic execution, and the fix is not novel either. Systems engineering has a mature answer, and two established techniques carry directly over.

Record/replay debugging. Record-and-replay debuggers have existed for decades for exactly the agent problem: a bug that will not reproduce. A 2024 ACM Queue article on deterministic record-and-replay describes the mechanism — during recording, the system stores the non-deterministic inputs passed to the process (for example, the bytes read from the network) into a log; during replay, it re-executes the process using the logged values to recreate the execution state. The insight that transfers cleanly: you record only the non-deterministic inputs, not the whole execution — feed those back and the deterministic parts recompute identically. Tools like rr apply this to ordinary Linux processes, turning a heisenbug into a reproducible one.

Event sourcing. A complementary idea from application architecture. In event sourcing, system state is not stored directly; every change is recorded as an event in an append-only log, and current state is derived by replaying that log from the start. Martin Fowler’s test for event sourcing is the relevant line: at any time you can discard application state and confidently rebuild it from the log. That gives replay a clean shape — model the run as an ordered event log, and it becomes something you can rehydrate, branch from, and step through.

An agent run recorder is record/replay debugging with event sourcing’s log structure, pointed at the five non-determinism sources above. The novelty is only in which inputs to capture; the technique is forty years old.

What a run recorder must capture

The recorder’s job is to log every non-deterministic input to the run, so that on replay nothing is drawn fresh. For an agent loop that means capturing, per step, an ordered event log of:

Every model call and its response. The fully assembled prompt sent, and the exact completion returned — every token. On replay this is fed back instead of calling the model. The single most important record: it makes the sampling deterministic without the model having to be.
Every tool and API call and its response. The tool name, exact arguments, and complete result — including empty and error results. On replay the recorded result is returned instead of hitting the live tool. The CRM that returned a malformed value once returns that exact value on every replay.
Every retrieval and its result. The query and the exact set of chunks returned. An index changes underneath you; the recording freezes what this run retrieved.
Seeds and any sampled randomness. Any RNG seed the agent or its libraries used — so the deterministic parts depending on it recompute identically.
Timestamps and clock reads. Every wall-clock value the run observed: the time a step ran, any “current date” or “as of now” the agent read. Replay feeds back the recorded time.
The assembled context per step. The full context window as it stood entering each step. Strictly this is derivable from the events above, but recording it directly makes a replay inspectable step by step — and lets the context-engineering essay’s window discipline be audited after the fact.

Ordering matters as much as contents: concurrency is a non-determinism source, so the log must record the order results actually arrived, and replay must honor it. The recorded run is the event-sourced log of one execution — complete enough that every non-deterministic input has a logged value.

Replaying a run

With a complete recording, replay is mechanical. Run the agent loop again, but intercept every call to a non-deterministic source and serve the recorded value instead of the live one:

   RECORD (production)                      REPLAY (debugging)
   ┌──────────────────┐                     ┌──────────────────┐
   │   agent loop     │                     │   agent loop     │  ◄ same code
   └────────┬─────────┘                     └────────┬─────────┘
            │ every call out                         │ every call out
   ┌────────▼─────────┐                     ┌────────▼─────────┐
   │  recorder shim   │                     │  replay shim     │
   │  · pass through  │                     │  · DO NOT call   │
   │    to real model │                     │    the real      │
   │    / tool / index│                     │    model/tool    │
   │  · log request   │                     │  · match request │
   │    + response    │                     │    to the log    │
   └────────┬─────────┘                     │  · return the    │
            │                               │    recorded resp │
   ┌────────▼─────────┐                     └────────┬─────────┘
   │  run event log   │ ───────────────────────────► │ (read-only)
   │  models, tools,  │     same log, replayed       │
   │  retrievals,     │                              ▼
   │  seeds, clocks   │                     deterministic re-execution:
   └──────────────────┘                     the bad run, reproduced exactly

Because every non-deterministic input now comes from the log, the run is deterministic: the same completions, tool results, retrievals, and clock — every time. The 2 a.m. refund failure reproduces on a laptop at 2 p.m., identically, as many times as needed. From there it is an ordinary debugging session: step through the recorded execution, inspect the context window entering the bad step, see the exact completion the model produced, find where the reasoning turned. The ghost is now a fixed, inspectable artifact.

Two practical notes. Replay is also a controlled experiment surface: change one input — patch the malformed CRM value, edit a prompt — replay, and see whether the bad decision survives the change. And a replay can diverge from the recording: change the code under test and a step may request something the log has no entry for. A divergence is itself a finding — it pinpoints exactly where new behavior departs from the recorded run.

The record/replay harness around the agent loop

The architecture is a thin layer: an interception shim wrapping the agent’s outbound calls, plus an event log, with two modes.

Record mode (production). The shim sits between the agent loop and every external dependency — model client, tool clients, retriever, clock, RNG. Each call passes through to the real dependency; request and response are appended to the run’s event log, tagged with step index and arrival order. Overhead is one log append per call.
Replay mode (debugging / CI). Same shim, same agent code, but calls do not go out — the shim matches each request against the event log and returns the recorded response. The model client is never contacted; no tokens spent; no tool hit.

Two properties make this work. First, the agent code is identical in both modes — record and replay differ only in what the shim does, never in the loop itself, so a replayed run exercises the real code path, not a reconstruction. Second, the shim is the only path to a non-deterministic source — if any code reaches the model, a tool, the clock, or the RNG around the shim, that call cannot be recorded and replay diverges there. The harness is easy to build and easy to undermine; the engineering is the discipline of routing everything through it.

It pairs naturally with the budgeting from the agent-budgets essay: the same interception point that records a model call already sees its token count and cost, so recorder and budget meter are the same shim wearing two hats.

Replay earns its keep beyond debugging

A run recorder built for debugging pays for itself twice, because a recorded run is a reusable asset.

Regression tests harvested from production. A recorded run is a complete, deterministic test case: replay it after a code or prompt change and assert the agent still behaves. The refund failure, once recorded, becomes a permanent regression test — the exact conditions that broke the agent, pinned, so the fix is verified and a re-break is caught in CI. You are not guessing at failure modes with synthetic cases; you are harvesting the real ones.

Eval cases from real runs. The agent-observability essay describes a golden set of tasks that gates merges, and the problem that the golden set is finite while production is not. Recorded runs are how it grows from reality: a sampled production run becomes a golden-set case with a known-good or known-bad outcome — real distributions, real tool responses, real edge cases, no invention required.

Audit and forensics. When an agent does something consequential — moves money, sends a message, changes a record — a deterministic recording answers “what exactly happened and why.” It is the role event sourcing plays in financial systems: an irreversible action should leave behind a replayable record of the run that produced it.

Agent frameworks have started to ship pieces of this. LangGraph’s time-travel feature checkpoints graph state at each step and lets you resume from a prior checkpoint — replay from that point unchanged, or fork with modified state to explore an alternative path. That is replay’s experiment surface, built in. It is checkpoint-of-state rather than a full record of every non-deterministic input, so it is a strong start rather than the whole technique — but it shows the direction, and a team on such a framework should use what it offers and record the rest.

The cost and storage tradeoffs

Replay is not free, and the costs are real enough to design around rather than ignore.

Storage. A recording stores every model prompt and completion and every tool result for a run. Recording everything in high-volume production is expensive. The standard answer is sampling — the posture the agent-observability essay takes to trace sampling: record a fixed fraction of runs, plus every run that trips a heuristic — an error, a cost outlier, an anomalous decision — so the runs most worth replaying are the ones most likely recorded. Retention is tiered: sampled-clean runs kept briefly, flagged and incident runs kept long.

Runtime overhead. In record mode the cost is one log append per external call — small against the latency of the call it wraps. The honest overhead is serialization volume, not the agent’s critical path.

Sensitive data. A recording captures everything the agent saw — prompts, tool results, retrieved documents — which will include PII and secrets. It is a sensitive data store: encrypt it, access-control it, and put it under the same retention and deletion rules as any other system holding that data. A right-to-be-forgotten request has to be able to reach recordings.

Recording drift. A replay is only valid against the code version that produced the recording. Replay months-old recordings against today’s agent and divergence is expected — sometimes signal, sometimes noise. Recordings should carry the code and prompt version that produced them, and a replay should report it.

None of these is a reason not to record. They are reasons to record deliberately — sample, tier retention, secure the store, version the recordings.

The checklist

Before an agent runs anything consequential in production:

Every non-deterministic input — model calls, tool calls, retrievals, seeds, clocks — is captured through a single interception shim.
The shim is the only path to a non-deterministic source; nothing reaches a model, tool, or clock around it.
A recorded run replays deterministically — same completions, same tool results, same retrievals, every time.
The agent code is identical in record and replay mode; replay exercises the real loop.
Replay honors the recorded arrival order of concurrent results.
Production recording is sampled, plus full capture of flagged and incident runs; retention is tiered.
Recordings carry the code and prompt version that produced them, and replay reports it.
Recordings are encrypted, access-controlled, and reachable by data-deletion requests.
Recorded failures are promoted into the regression suite and the eval golden set.

When you can check all nine, an unreproducible failure is a contradiction in terms. Until then, “could not reproduce” is just where your hardest bugs go to wait.

Reading list

The ACM Queue article on deterministic record-and-replay — the systems-engineering technique agents are borrowing, and the key idea that you record inputs, not whole executions.
Martin Fowler on Event Sourcing — the append-only-log architecture that gives a recorded run its shape, and the test for whether you are really doing it.
LangGraph’s time-travel documentation — a framework-level take on replay and forking from checkpointed state; a concrete starting point, if not the whole technique.

A bug you cannot reproduce is not a bug you have fixed — it is a bug you are waiting on. Record the run, replay it exactly, and the ghost becomes an engineering problem with an answer.

Deterministic replay: debugging agents that will not reproduce.

Why an agent run will not reproduce

Why this makes agents uniquely hard to debug

Deterministic replay is borrowed, not new

What a run recorder must capture

Replaying a run

The record/replay harness around the agent loop

Replay earns its keep beyond debugging

The cost and storage tradeoffs

The checklist

Reading list

Indirect prompt injection, by the numbers.

Your learning-rate schedule silently overrides your data-curation decisions.

Approving an agent's action is not authorizing it.

Tell us about it.

Got it.