A team ships an AI feature. The eval-pass rate is at the budget, the p95 latency is at the budget, the cost per request is at the budget. The consultancy hands off. The team adds the dashboards to a shared channel that nobody looks at every day, because nothing is broken. A quarter passes. The customer-support volume on the feature creeps up — not enough to file an incident, just enough that a senior engineer notices the queue is longer than it was. They open the dashboard. The eval-pass rate is at 0.88. The budget set in week two of the engagement was 0.92. Nobody got paged because the threshold for paging was 0.80.
The system did not break. The system slid. The threshold that was correct for a system shipping below 0.80 in development is not the threshold that is correct for a system holding above 0.92 in production. The gap between “running” and “running well” lives in the difference between those two numbers, and the gap is where the work of operating an AI system actually happens.
This post is the field map of that gap. The three signals that drift in the first quarter after handoff, what their honest baselines look like, and the operating discipline that catches the drift before the customer-support queue does.
The decay is not a model problem
The reflex when a deployed AI system underperforms is to ask what changed in the model. The model is rarely what changed. Sculley and colleagues laid out the structural reason a decade ago in Hidden Technical Debt in Machine Learning Systems (NeurIPS 2015): the model is “at most a tiny black box” inside a system whose total complexity is the data pipelines, feature stores, glue code, configuration, and downstream consumers — and “it is dangerous to think of these as just incurring high maintenance costs” because the cost compounds in a way that ordinary software debt does not. The paper named CACE — Changing Anything Changes Everything — as the property that makes ML systems unusually slippery: there is no good notion of locality for a change, because every input feeds into every prediction.
The practical consequence at month three is that the model on disk is the same model that shipped, the prompts are the same prompts, the retriever is the same retriever — and the system is performing measurably worse. The system is worse because the world the system is consuming is different from the world the system was evaluated against, and an unchanged system against a shifted world looks indistinguishable from a degraded system. The repair is not a model swap. The repair is finding the input distribution that drifted, the cost driver that bloated, or the dashboard threshold that was too generous to catch what was happening.
Signal one: the eval-pass rate slides under its own threshold
The eval suite a team owns at handoff is calibrated to the engagement’s exit conditions. The pass-rate budget — call it 0.92 — is what the system held at the production deploy. The CI gate is usually wired to fail a deploy below a lower bar, typically 0.80 or 0.85, because a CI gate is a deploy-blocker and a deploy-blocker that fires too eagerly gets disabled. Both numbers are correct for what they do. Neither of them catches the system as it slides from 0.92 to 0.88 to 0.86 over a quarter — none of those values trip the CI gate, all of them are below the budget set against a real customer expectation.
The fix is not to raise the CI gate. The CI gate is a deploy-blocker, not a steady-state monitor, and raising it punishes the wrong thing. The fix is a separate steady-state alert wired to the budget — a slow-burn alarm that fires when the seven-day moving average of the eval-pass rate drops more than a fixed delta below the handoff budget, regardless of whether the system is technically passing CI. The two thresholds answer different questions: the CI gate asks is this deploy fit to ship? and the steady-state alarm asks is this system still doing what we said it would? A team that watches only the CI gate will reliably miss the second question.
The dashboard view that catches the slide is the per-category breakdown over time. Aggregate pass rate is an average across categories, and a single category falling apart — say, a corpus area that gained new documents the retriever has not been tuned for — can hide inside an aggregate that still looks fine. The post on LLM-as-judge evaluation is the framework for trusting per-category signals; the dashboard discipline is to plot every category every week and look at them one at a time, not as one number.
Signal two: retrieval recall drifts with the corpus
RAG systems pass eval at handoff against a snapshot of the corpus. The corpus is rarely a snapshot in production. New documents land in the source-of-truth system — a CMS, a Confluence space, a Salesforce knowledge base — and the indexed corpus is one re-embed and one chunk-pass behind. The lag is usually invisible because the eval set is mostly old queries against mostly old documents, and the new documents are not what the eval is grading.
The drift surface is at the intersection of two facts: the corpus is changing daily, and the eval set was written at handoff. After a quarter, the gap between what the eval grades and what the system answers is meaningful. The system can be answering today’s questions worse than the eval set thinks because the eval set is not asking today’s questions. Recall is the signal: a recall metric run weekly against the live corpus on a freshness-balanced query set will surface the drift earlier than a static eval will. If recall on the last 30 days of new documents is below recall on documents older than 90 days, the retriever is the place to look, not the model.
The retrieval-side analogue to the eval slide is more subtle than aggregate recall makes it look — the conditions for a retrieval improvement to translate into an answer-quality gain are not automatic, and a recall metric can move with the corpus while the answer quality the customer sees does not. The operating consequence: a team that watches recall but not category-level eval-pass is reading half the dashboard. The two signals interact, and they need to be read together.
Signal three: cost per request creeps
The third drift signal is the one most teams notice last because nobody pages on a number that is still inside the budget. Cost per request at handoff is the price of one inference at the configuration the system shipped with. Three months later, the cost per request is usually higher — not because the underlying API got more expensive but because the system is doing more per request. Context windows grew because someone added a new retrieved field to the prompt. Tool calls multiplied because the agent harness learned to retry. Rerankers moved from a sample of candidates to all candidates. None of those was an outage; all of them moved the bill.
The dashboard view that catches it is the cost decomposition — per request, broken down by which line item drives the cost. The analytical lens for compounding step costs in an agent harness, where each added step compounds error and bill, is the right framing here; the operating practice is to plot the same decomposition every week and watch which lane is widening. Cost creep is usually a configuration drift, and configuration drift is among the dirtiest kinds of debt the Sculley paper names: a configuration change that touches behavior is invisible to the kind of testing that catches code changes, and the system as a whole performs differently with no diff to point at. The post on agent budgets is the budget-side companion to this lens: where the cost ceilings live in production and how the dashboard reads them.
The remediation is rarely a model swap, either. It is finding the configuration that drifted — usually in a YAML file or a feature flag or a prompt template — and either rolling it back or recosting the budget against the new behavior, intentionally and in writing. Drift that becomes the new baseline silently is the worst case; drift that becomes the new baseline because someone wrote down why is fine.
The discipline is staring at the dashboard
There is no algorithmic fix for the failure mode this post is about. The fix is the operating practice of looking at the dashboard before the customer-support queue tells you to. The pattern Google’s SRE writing has documented for a decade (Site Reliability Engineering, Beyer et al., 2016) — monitoring is the foundation, error budgets are the contract, postmortems are the learning — translates onto AI systems with one twist: the failure modes are slower and quieter than network failures, and the dashboards have to be read on a longer cadence than human attention naturally pays.
What that looks like in practice for a team running a system handed off three months ago: a fifteen-minute review every Monday morning, with one engineer named and on the calendar; a one-page memo summarizing what the dashboards say, written every Monday and circulated; a per-category eval-pass plot, a recall-by-document-age plot, and a cost-decomposition plot on the same page; and a documented threshold for each that escalates to a paging alert, not to a Slack ping in a channel nobody watches. The cadence is the point. A weekly review catches a slide that goes unnoticed for a month if the channel is only watched when something is on fire.
The 30-day support window after a Proof of Tech engagement is exactly long enough to embed this discipline. It is too short to substitute for it, which is the reason the practice is the handoff, not the support window.
What the numbers do not say
A few honest qualifications, because this post is grounded more in field practice than in peer-reviewed evidence.
First, the specific decay rate — “0.92 at handoff, 0.86 by month three” — is illustrative, not measured against a published corpus. The pattern is what is real; the precise numbers are stand-ins for the shape of the curve and will differ across systems, domains, and corpora. A team that takes the literal number as a prediction is reading too much into a story whose point is the shape, not the slope.
Second, the three signals named here — eval-pass rate, retrieval recall, cost per request — are the most common ones a deployed RAG or agent system drifts on. They are not the only ones. Latency, refusal rate, tool-call patterns, and safety-eval pass rate all drift too. The three were picked because they are measurable on dashboards a team already has at handoff. A team running a system with a different profile should add the signals that match its risk surface, not adopt these three uncritically.
Third, “the model is rarely what changed” is a heuristic, not a law. Sometimes the model is what changed — a provider deprecates a version, a fine-tune is replaced, a weight checkpoint gets swapped. The fix is not to skip the model when investigating; it is to investigate every layer and not start at the model on instinct. The first place to look is the input distribution and the configuration, because that is where the drift usually lives. The model is the second or third place, and the assumption that it is the first place is the one this post is arguing against.
The checklist
The operating practice of catching the quiet drift, before the support queue does:
Reading list
- Hidden Technical Debt in Machine Learning Systems (Sculley et al., NeurIPS 2015) — the canonical paper on why ML systems accumulate debt faster than software, and why the model is “a tiny black box” inside a larger system whose changes are invisible to ordinary testing.
- Site Reliability Engineering (Beyer et al., O’Reilly 2016) — the discipline of monitoring, error budgets, and postmortems, free online and load-bearing for the operating practice this post recommends.
- LLM-as-judge evaluation — the framework for the per-category eval-pass signal this post asks you to plot weekly, including the calibration steps that make a judge trustworthy enough to read.
- Agent budgets — the budget-and-ceiling lens for the cost-creep signal: where the per-tenant, per-tool, and per-step caps live, and how they fail closed when configuration drift moves the bill.
- Agent observability — the dashboard primitives that catch drift before the customer-support queue does: span traces, per-tool spend attribution, and the steady-state monitors that sit beside the CI gate.
The system that passed every eval at handoff is the system that is most likely to slide quietly, because the team is most likely to trust that nothing is wrong. A quarter is enough time for the eval-pass rate to drop four points, the retrieval-on-new-documents recall to lag the rest by ten, and the cost per request to climb fifteen percent — none of it dramatic, none of it visible without a dashboard somebody is looking at. The discipline that catches the slide is the discipline of staring at the dashboard on a Monday morning when nothing is broken. That is what month three actually looks like, and it is the part of running an AI system in production that no one ships at handoff.