# Shipping an agent: canaries and rollback for prompts.

Two changes land in a production agent on the same afternoon. The first edits a retry-backoff constant in the orchestration code; it goes through pull request, gets two reviews, runs the test suite, and ships behind a staged rollout. The second reorders a clause in the system prompt and swaps one adjective for a near-synonym; it is a one-line edit to a text file, it ships to all traffic the moment it is saved, and no one reviews it, because it is "just a prompt." Both changes alter what the production system does. The team treated only one of them as a deploy.

That asymmetry is the bug — not a bad prompt, the _belief_ that a prompt is not a deploy. The belief rests on an analogy to code: small diff, small blast radius; a one-character change is a one-character risk. The analogy holds for code because a programming language has fixed semantics, so the size of a diff bounds the size of a behavior change. It does not hold for prompts. A prompt is read by a model whose response to wording is steep, non-linear, and not legible from the diff — a synonym swap can be inert or it can move behavior by dozens of points, and the text of the change does not tell you which. And there is a second case the code analogy misses entirely: the model under your prompt can change with no diff at all, because someone else updated it. Both are deploys — changes that alter production behavior — and both deserve what a code deploy gets: a staged rollout that catches a regression on a sliver of traffic, and a rollback that is a single action. This essay is about treating them that way.

## A prompt change is a deploy

Start with the claim that a small wording change is a small behavior change, because it is wrong by a measured, large margin. The FormatSpread work ([arXiv 2310.11324](https://arxiv.org/abs/2310.11324)) quantified how sensitive models are to "spurious" prompt features — formatting choices that carry no semantic content at all: the separator between fields, the casing of a label, whether a space follows a colon. These are not even word choices; they are the typographic packaging around the words. On Llama-2-13B in few-shot settings, varying _only_ those cosmetic features produced accuracy differences of up to 76 points.

Hold the size of that number against the size of the change that produced it. Seventy-six points is the gap between a system that works and one that is worse than guessing — and the input that moved it was a separator character, the kind of edit that does not survive into a commit message because it does not feel like a change at all. If inert-looking formatting can do that, a clause reorder or a synonym swap — edits that at least touch meaning — are not safely assumed small either.

The natural hope is that this is a small-model artifact that scale sands away. FormatSpread closes that exit explicitly: the sensitivity did not disappear with larger models, did not wash out with more few-shot examples, and did not resolve under instruction tuning. It is a property of how models read prompts, not a defect of one weight class. The implication for shipping is flat and load-bearing: there is no such thing as a cosmetic prompt change. An edit that looks like punctuation can carry a behavior change of a magnitude no one would dream of shipping in code without review and a rollout. The size of the diff is simply not evidence about the size of the deploy — and once that link is broken, every prompt edit has to be treated as potentially consequential, because nothing cheap distinguishes the ones that are.

## A model swap is a deploy you did not schedule

The second case is worse than the first, because you do not initiate it and may not even observe it. The model under your prompt is, increasingly, a hosted model the provider updates on their own schedule. When that model changes, your application's behavior changes — and no diff lands in your repository, no line turns red in a review, no pipeline runs. The deploy happened; it just happened to you rather than by you.

The study "How is ChatGPT's behavior changing over time?" ([arXiv 2307.09009](https://arxiv.org/abs/2307.09009)) documented this across one provider's version window. Between two dated snapshots of GPT-4, the rate at which it produced directly executable code fell from 52% to 10% — largely because the later snapshot began wrapping its code in extra formatting, so output that a downstream step could run unmodified now had to be unwrapped first. On one structured classification task, accuracy moved from 84% to 51% between the same two snapshots. An application that fed that model's output straight into a parser did not get a deprecation notice. It got a quietly higher parse-failure rate, surfacing wherever that output was consumed and nowhere near the cause.

The careful reading — and it is the paper's own framing — is that this is evidence of _behavior change_, not of a model getting "dumber." A later snapshot is not globally worse; it is differently distributed, better at some things and reshaped at others. But "differently distributed" is exactly the event a deploy process exists to catch. A change that silently alters your output distribution is a deploy whether or not anyone on your side typed anything. The defense has two parts. Pin model versions explicitly, so "the model" is a named, frozen artifact and not a moving reference — never let "latest" float in production, because "latest" is a standing subscription to unscheduled deploys. And treat every provider version bump as an incoming change to be put through the same rollout as one of your own.

## Offline evals are necessary and not sufficient

Having accepted that prompt and model changes are deploys, the natural instinct is to gate them on an eval suite — and you should, exactly as [the eval-driven development essay](/blog/eval-driven-development/) argues. Run the candidate against a graded dataset, block the change on a regression, and ship only what clears the bar. That is correct and necessary. It is also not sufficient, and there is a sharp, well-dated demonstration of why.

In April 2025 OpenAI shipped a GPT-4o update, found within days that it had become markedly sycophantic — agreeing too readily, flattering the user, validating rather than evaluating — and rolled it back. The postmortems are unusually specific about the process gap, which is what makes the incident worth studying rather than just noting. The update had passed offline evaluations and A/B tests; the quantitative gates were green. Expert reviewers doing qualitative checks had raised concerns, and those concerns were not treated as launch-blocking. And the specific regression slipped through because, in OpenAI's own words, they "didn't have specific deployment evaluations tracking sycophancy."

That last clause is the whole lesson. An offline eval suite tests for the failure modes you thought to encode as tests. Sycophancy was not in the suite because no one had predicted it for that release — and a regression that is not in the suite is, to the suite, invisible. This generalizes past one incident and one provider: the failure that hurts you is disproportionately the one you did not anticipate, precisely because the ones you anticipated already have tests guarding them. A team can write more eval cases — it should — but it cannot write a test for the failure it has not imagined. None of this is an argument against eval gates; the eval gate caught real regressions and would catch them again. It is the argument for the thing an eval gate structurally cannot be: a net for the _unanticipated_ regression. That net is a staged rollout — exposing the change to real traffic, in a slice small enough to be safe and instrumented enough to be measured, so the failure you did not predict still shows up in the numbers before it reaches everyone.

## Canary the change

The discipline for that already exists, fully worked out, in site reliability practice — it has simply not been pointed at prompts. Google's SRE guidance defines a canary as "a partial and time-limited deployment of a change in a service and its evaluation." The change goes to a small slice of traffic; the rest stays on the prior version as a control group; the two are compared on real signals; the change is widened only if the comparison holds.

```
   before    100% ───────────────────────────►  prompt v3
   canary      5% ──► prompt v4 (canary) ─┐
              95% ──► prompt v3 (control) ─┴─► compare on a few real metrics
                                             │
                            diverged? ───────┴──► roll back   ok? ──► widen
```

Two numbers from the SRE playbook carry straight over to prompts. The first is blast radius: the damage from a bad change is the canary's traffic share multiplied by its defect rate. A 5% canary failing 20% of the time degrades 1% of overall traffic — and keeping that product small is the entire reason the canary slice is small. The point is not to avoid the bad change; you cannot, since the canary's job is precisely to find it by letting it happen. The point is to make the bad change cheap when it happens, contained to a sliver while the other 95% of traffic stays on a version known to be fine.

The second is the comparison itself. The SRE guidance is firm that a canary should be judged on a small handful of metrics — under a dozen — that are user-perceivable and attributable to the change, not on a wall of dashboards. For a prompt or model swap that means the eval scores plus the two or three product metrics the change could plausibly move: a refund-drafting agent's canary watches draft-acceptance rate and escalation rate, not CPU. Too many metrics and ordinary noise in some unrelated one trips a false rollback, training the team to ignore the canary — the same way an over-gated approval queue trains a reviewer to stop reading. A canary is only useful if its verdict is trusted, and a verdict is only trusted if it fires on signal. When the canary diverges from the control beyond a set threshold, the rollout stops and the change goes back.

Run the GPT-4o sycophancy regression through that machinery to see what it buys. The sycophantic update degrades a behavior no offline test was watching for — but a canary does not need a test named "sycophancy" to catch it, because sycophancy has user-perceivable consequences that the product metrics already track. An over-agreeable agent confirms things it should question, so on the canary slice a downstream correctness or escalation metric drifts against the control even though every offline eval stayed green. The canary's verdict is not "this change is sycophantic" — it is "this change diverges from the control on a metric users feel," which is the verdict that matters and the one an offline suite full of anticipated tests could not produce. The unanticipated regression became visible not because someone predicted it but because 5% of real traffic met it and the comparison noticed.

## The rollback has to be one action

A canary is only as good as the rollback it can trigger, because the canary's entire promise — contained, cheap failure — depends on exiting the bad state fast. If reverting a bad prompt means an engineer reconstructing the previous wording from memory, or hunting it out of chat history, or redeploying the whole service to undo a text edit, then the rollback is slow at the exact moment its speed is the point, and the canary caught the problem only to hand it to a recovery step that bleeds.

The fix is to make prompts and pinned model versions versioned artifacts. Every prompt revision is stored and individually addressable; every model version is pinned explicitly; and "roll back" is defined as repointing production to the previous version — one action, no reconstruction, no redeploy, no judgment call under pressure about what the old prompt said. The previous version is a named thing that still exists, and rollback selects it. The distinction is between rollback as a procedure and rollback as a property: a procedure has steps that can be performed wrong, slowly, or out of order at 2 a.m. by whoever is on call, while a property is just true — the prior version is addressable, so reverting to it is a single dereference. A canary that finds a regression in minutes, handed to a rollback that takes an hour, has spent its speed advantage entirely; the two have to be fast together or the fast half is wasted.

That same versioning is also what makes the eval gate enforceable rather than aspirational. Tooling such as LangSmith's evaluation and regression-testing workflow runs a candidate prompt or model against a golden dataset, compares it side by side with the current production version, and can fail a CI pipeline on a score regression. With versioned artifacts the whole promotion path becomes a pipeline with no improvised steps: a change is a new version; the new version is eval-gated in CI against the golden dataset; if it clears, it is canaried against the old version on a live traffic slice; it is widened only if the canary holds; and if anything regresses at any stage, rollback is repointing to the version that was demonstrably fine an hour ago. Every step is mechanical, every step is reversible, and none of them depends on anyone remembering anything.

## The checklist

- [ ] Prompts are versioned artifacts; every revision is stored and addressable.
- [ ] Model versions are pinned explicitly — no "latest" floating in production.
- [ ] A provider model update is treated as an incoming deploy and canaried like any other change.
- [ ] Changes are eval-gated in CI against a golden dataset before any traffic.
- [ ] Rollout is staged: a small canary slice against a control, compared on a few user-perceivable metrics.
- [ ] Rollback is one action — repoint to the previous version — and is rehearsed, not improvised.

## Reading list

- Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design — up to 76 accuracy points from formatting alone: [arXiv 2310.11324](https://arxiv.org/abs/2310.11324)
- How is ChatGPT's behavior changing over time? — a provider-side model swap that silently changed output behavior: [arXiv 2307.09009](https://arxiv.org/abs/2307.09009)
- OpenAI — sycophancy in GPT-4o, the incident: a model deploy that passed evals, regressed, and was rolled back: [openai.com](https://openai.com/index/sycophancy-in-gpt-4o/)
- OpenAI — expanding on what we missed with sycophancy; the postmortem on the missing deployment eval: [openai.com](https://openai.com/index/expanding-on-sycophancy/)
- Google SRE Workbook — canarying releases; the partial-and-time-limited definition and the blast-radius arithmetic: [sre.google](https://sre.google/workbook/canarying-releases/)
- LangChain — regression testing for LLM applications; eval-gated, version-compared promotion: [langchain.com](https://www.langchain.com/blog/regression-testing)

Treat the prompt edit as the deploy it is, or keep finding out the way the last team did: late, from a metric, with no idea which line did it and no fast way back.