Your golden set is rotting.
A golden evaluation set is not a fixed asset — it decays. The world changes, the product shifts, the team overfits, and the pass rate quietly stops meaning anything. Eval data needs a maintenance protocol.

A golden evaluation set is not a fixed asset — it decays. The world changes, the product shifts, the team overfits, and the pass rate quietly stops meaning anything. Eval data needs a maintenance protocol.

The eval suite has read 95% for two quarters. It runs in CI, it gates every merge, and nobody has touched it because nothing has gone wrong with it. Meanwhile the support queue is filling with the same complaint in different words: the product used to handle this, and now it does not. The dashboard says the system is healthy. The users say it regressed. Both are reporting honestly.
The instinct is to trust the dashboard — it has hard numbers, the users have anecdotes. That instinct is the bug. The 95% is also an anecdote; it is the answer to a question the golden set asked a year ago and has been asking, unchanged, ever since. The world the set was built to probe has moved, and the pass rate did not fall because the set lost the ability to detect failure — the way a thermometer reads room temperature accurately while the building burns down around it.
A golden evaluation set gets treated as a fixed asset — built once, then trusted forever, like a unit test that either passes or does not. It is not that kind of asset. It is a sample of a moving distribution, and a sample goes stale. The facts in it date, the product drifts away from it, the team grinds it down, its labels accumulate errors, and its distribution slides away from the traffic it was meant to mirror. A golden set without a maintenance protocol does not hold its value. It rots, on its own schedule, while showing you a number that says otherwise. This post is the protocol.
“The set decayed” is not one failure. It is five, with five different causes and five different fixes. Naming them separately is the whole point, because a team that sees only “the number is stale” reaches for one fix and misses four.
Five failures, one symptom. The pass rate is high, holds steady, and tells you nothing — because a number can only mean something if the thing producing it has been kept honest.
One of those five deserves its own paragraph, because teams assume their golden labels are clean and the published evidence says otherwise.
The reference result is Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks (Northcutt et al., 2021). The authors went through ten of the most-cited ML benchmarks — datasets thousands of papers had treated as ground truth — and found an average of about 3.4% label errors. Not in obscure sets; in the canonical ones. And the consequence was not cosmetic: on corrected labels, benchmark rankings changed — models ranked behind others moved ahead once the mislabeled examples were fixed. The leaderboard had partly been measuring noise.
Carry that to your golden set. It is smaller, more domain-specific, and labeled with less ceremony than ImageNet. There is no reason to assume it beats 3.4%, and several reasons — ambiguous domain calls, specs that shifted after labeling, less reviewer redundancy — to assume worse. A few percent wrong labels means a few percent of your pass rate is pure noise, sitting in fixed locations, quietly failing your good builds and passing your bad ones. Label review is not housekeeping. It is the part of maintenance that determines whether the other parts measure anything.
The rot is gradual and the number is reassuring, so you need symptoms that name the disease before a customer does.
If you recognize two of these, the set is rotting. The number is not the question; whether the number still discriminates is.
A golden set holds its value the way any sampled dataset does: someone owns it, and it is refreshed on a cycle. The protocol below is the standing process — not a one-time cleanup, a cadence.
The golden set is an artifact with a history, so give it one. Each version is tagged, dated, and accompanied by a changelog: cases added, retired, relabeled, and why. The payoff is comparison that means something — when a pass rate moves, you can tell whether the build changed or the set changed. Eval results reported against an unversioned, mutable set are uninterpretable: two runs were never graded against the same instrument.
This is the engine of the whole protocol. The chunking essay makes the prior point — build a graded question set before tuning anything; maintenance is that argument extended through time: keep building it. On a regular cycle, pull real queries from production traces, especially ones the current system handled poorly, label them, and fold them in. Fresh cases pull the set’s distribution back toward live traffic, and they are the cases the team has not had a chance to overfit to yet. A set that only ever grows from the imagination of its authors drifts away from reality at exactly the rate reality changes. (Mining traces for eval cases is one more reason the trace pipeline the agent-observability essay argues for is not optional — you cannot sample from production you did not record.)
Maintenance removes as well as adds. Two kinds of case should leave the set. Stale ones — the golden answer describes a world that no longer exists; re-label or delete. Solved ones — cases the system has passed every run for a long stretch. A permanently-green case contributes no information; it cannot move, so it cannot tell you anything. Retiring it (or moving it to a cheap, separate smoke-test tier) keeps the golden set concentrated on behavior that is still live and still uncertain. A set that only grows becomes slow and dilutes its own signal under a rising tide of foregone-conclusion passes.
Given the 3.4% result, periodic re-review is not optional. On a schedule, sample the existing golden answers and have a human — ideally not the original author — re-judge them against the current spec and the current world. Pay special attention to cases that disagree with a build you have independent reason to trust: when a model you believe is correct fails a golden case, the golden label is a prime suspect, not an afterthought. Every correction is a changelog entry.
Every case carries metadata: where it came from (production trace, hand-authored, incident postmortem), when it was added, when its label was last reviewed, and how often it has flipped. This is what makes the rest of the protocol mechanical instead of heroic. “Show me every case not reviewed in nine months” becomes a query, not an archaeology project. A case with no provenance is a case you cannot maintain, because you cannot tell whether it is stale — exactly the way an observable agent is only as auditable as the metadata trail it leaves.
The deepest check, and the one almost nobody runs: periodically ask whether the golden set can still tell a good build from a bad one. The test is direct — take a build you know is worse (an older or deliberately degraded checkpoint) and one you know is better, and confirm the set scores them in the right order with a meaningful gap. If the known-bad build still passes, or the gap has compressed to noise, the set has lost discriminative power and no amount of fresh cases at the margin will restore it. This is the eval-data analogue of the judge meta-eval: that essay measures whether your judge can still grade; this measures whether your data can still separate. A suite that gates merges should be re-qualified to do its job, on a cadence, like any instrument that matters.
Versioning and re-review keep the set correct. Coverage is a separate axis: is the set still representative. A golden set can be perfectly labeled and entirely current and still measure the wrong thing, because its mix of cases no longer matches the mix of cases users actually send.
Audit it the way you would audit any sample for bias. Bucket production traffic by the dimensions that matter — feature area, query type, user segment, language, difficulty — and bucket the golden set the same way. Compare the histograms. The gaps are your blind spots: a feature that is 30% of traffic and 5% of the golden set is a feature your eval is mostly not testing, and a regression there will ship green. New buckets that exist in traffic and not in the set at all are pure blind spots. Coverage drift is quieter than stale facts — nothing in the set looks wrong — which is exactly why it needs a deliberate, scheduled comparison against the live distribution rather than a glance.
This is also the line between this post and its sibling. Coverage and freshness are properties of the eval data — this post’s subject. Whether the model grading that data is itself biased or miscalibrated is a property of the eval judge, and that is the LLM-as-judge essay. A trustworthy eval needs both. A rotting set with a perfect judge and a perfect set with a drifting judge fail the same way — a confident number nobody should trust.
A golden set is maintained, or it is decaying. There is no third state.
A golden set is a sample of a moving target, and a sample goes stale. Maintain it on a cadence, or accept that your pass rate is a number from last year wearing this year’s date.

A one-word change to a system prompt can move accuracy by dozens of points, and a provider's model update can regress your app overnight. A prompt or model swap is a deploy. Give it a staged rollout and a one-action rollback path.
11 min →
The monthly inference bill arrives as one number, and nobody can say which agent, which customer, or which tool spent it. Agent cost is too variable to estimate and has to be attributed after the fact — per run, per tool, per tenant. The layer most stacks skip.
11 min →
An agent that asks permission for everything trains its reviewers to rubber-stamp, and the one dangerous action slips through in the noise. Approval gates belong on consequence and on uncertainty — not on every step. Where to put them.
12 min →