Your golden set is rotting.

The eval suite has read 95% for two quarters. It runs in CI, it gates every merge, and nobody has touched it because nothing has gone wrong with it. Meanwhile the support queue is filling with the same complaint in different words: the product used to handle this, and now it does not. The dashboard says the system is healthy. The users say it regressed. Both are reporting honestly.

The instinct is to trust the dashboard — it has hard numbers, the users have anecdotes. That instinct is the bug. The 95% is also an anecdote; it is the answer to a question the golden set asked a year ago and has been asking, unchanged, ever since. The world the set was built to probe has moved, and the pass rate did not fall because the set lost the ability to detect failure — the way a thermometer reads room temperature accurately while the building burns down around it.

A golden evaluation set gets treated as a fixed asset — built once, then trusted forever, like a unit test that either passes or does not. It is not that kind of asset. It is a sample of a moving distribution, and a sample goes stale. The facts in it date, the product drifts away from it, the team grinds it down, its labels accumulate errors, and its distribution slides away from the traffic it was meant to mirror. A golden set without a maintenance protocol does not hold its value. It rots, on its own schedule, while showing you a number that says otherwise. This post is the protocol.

Five ways a golden set rots

“The set decayed” is not one failure. It is five, with five different causes and five different fixes. Naming them separately is the whole point, because a team that sees only “the number is stale” reaches for one fix and misses four.

The facts go stale. A golden answer is correct on the day it is written and not one day longer than the world it describes. “The current rate limit is 1,000 requests per minute”; “the supported regions are X, Y, Z”; “the latest model is N.” The product ships a change, the golden answer does not, and now the set marks the right new answer as wrong and rewards a model that recites the old one. This is the rot that bites hardest in retrieval systems, where the corpus moves constantly — the faithfulness-versus-groundedness essay named the live version of this failure: an answer perfectly grounded in a stale source. A stale golden answer is that same stale source, baked into your eval.
The product’s scope drifts away from the set. The golden set was drawn when the product did three things. It now does seven, and two of the original three were deprecated. The set still spends its questions on the old shape of the product. It over-tests retired features and barely touches the new surface area — so the eval is most confident about exactly the parts that matter least.
The team overfits to it. This is the subtlest and most corrosive, because it is caused by the set working. Every case the golden set catches gets fixed — that is the point of having it. But the fixes accumulate against this specific set, and over time the only cases left are the ones the system already handles. Engineers glance at failing items and tune for them. The pass rate climbs toward 100% not because the system got broadly better but because the set got memorized. This is the exact mechanism behind benchmark overfitting in the literature, and it has been measured.
Label errors accumulate. Every golden answer was written or reviewed by a human, and humans err. A misread requirement, a defensible-but-wrong call on an ambiguous case, an answer that was right before a spec changed. Each wrong label is a permanent trap: it fails correct outputs and passes incorrect ones, and because it is in the golden set nobody questions it — the set is the thing other things get checked against. The errors are load-bearing and invisible.
The distribution drifts from production. The set was a fair sample of real traffic on the day it was assembled. Traffic then moved — new user segment, a feature that changed how people phrase things, a seasonal shift, a workflow nobody predicted. The set is now a fair sample of a distribution that no longer exists. It tests the questions users used to ask.

Five failures, one symptom. The pass rate is high, holds steady, and tells you nothing — because a number can only mean something if the thing producing it has been kept honest.

Label errors are not a rounding error

One of those five deserves its own paragraph, because teams assume their golden labels are clean and the published evidence says otherwise.

The reference result is Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks (Northcutt et al., 2021). The authors went through ten of the most-cited ML benchmarks — datasets thousands of papers had treated as ground truth — and found an average of about 3.4% label errors. Not in obscure sets; in the canonical ones. And the consequence was not cosmetic: on corrected labels, benchmark rankings changed — models ranked behind others moved ahead once the mislabeled examples were fixed. The leaderboard had partly been measuring noise.

Carry that to your golden set. It is smaller, more domain-specific, and labeled with less ceremony than ImageNet. There is no reason to assume it beats 3.4%, and several reasons — ambiguous domain calls, specs that shifted after labeling, less reviewer redundancy — to assume worse. A few percent wrong labels means a few percent of your pass rate is pure noise, sitting in fixed locations, quietly failing your good builds and passing your bad ones. Label review is not housekeeping. It is the part of maintenance that determines whether the other parts measure anything.

The symptoms — how to know it is happening to you

The rot is gradual and the number is reassuring, so you need symptoms that name the disease before a customer does.

High pass rate, rising complaints. The scenario at the top. The eval and the users disagree, and the eval is losing. When real-world signal contradicts the dashboard, the dashboard is the stale one — production is, by definition, current.
The set stops catching regressions. A bug ships, users find it, you trace and fix it — and then notice the golden set was green through the entire incident. A set that passes a build users would reject has lost discriminative power. Every escaped regression is the set telling you it has gone blind to a region of behavior.
The score is pinned near the top and flat. A pass rate that has sat at 94–96% for several quarters is not evidence of a stable, excellent system. It is the signature of a set that has been overfit — every case it can catch is long since fixed, so it has no movement left to report. A healthy eval set, refreshed against a changing product, should occasionally dip as it surfaces something new.
Failures cluster on stale facts. When you do read the failing items, they are not interesting model errors — they are the model giving the new correct answer while the golden answer still holds the old one. That is not the system failing the eval. That is the eval failing the system.

If you recognize two of these, the set is rotting. The number is not the question; whether the number still discriminates is.

A maintenance protocol

A golden set holds its value the way any sampled dataset does: someone owns it, and it is refreshed on a cycle. The protocol below is the standing process — not a one-time cleanup, a cadence.

Version the set

The golden set is an artifact with a history, so give it one. Each version is tagged, dated, and accompanied by a changelog: cases added, retired, relabeled, and why. The payoff is comparison that means something — when a pass rate moves, you can tell whether the build changed or the set changed. Eval results reported against an unversioned, mutable set are uninterpretable: two runs were never graded against the same instrument.

Sample fresh cases from production traces

This is the engine of the whole protocol. The chunking essay makes the prior point — build a graded question set before tuning anything; maintenance is that argument extended through time: keep building it. On a regular cycle, pull real queries from production traces, especially ones the current system handled poorly, label them, and fold them in. Fresh cases pull the set’s distribution back toward live traffic, and they are the cases the team has not had a chance to overfit to yet. A set that only ever grows from the imagination of its authors drifts away from reality at exactly the rate reality changes. (Mining traces for eval cases is one more reason the trace pipeline the agent-observability essay argues for is not optional — you cannot sample from production you did not record.)

Retire stale and solved items

Maintenance removes as well as adds. Two kinds of case should leave the set. Stale ones — the golden answer describes a world that no longer exists; re-label or delete. Solved ones — cases the system has passed every run for a long stretch. A permanently-green case contributes no information; it cannot move, so it cannot tell you anything. Retiring it (or moving it to a cheap, separate smoke-test tier) keeps the golden set concentrated on behavior that is still live and still uncertain. A set that only grows becomes slow and dilutes its own signal under a rising tide of foregone-conclusion passes.

Re-review labels on a cycle

Given the 3.4% result, periodic re-review is not optional. On a schedule, sample the existing golden answers and have a human — ideally not the original author — re-judge them against the current spec and the current world. Pay special attention to cases that disagree with a build you have independent reason to trust: when a model you believe is correct fails a golden case, the golden label is a prime suspect, not an afterthought. Every correction is a changelog entry.

Track provenance and last-reviewed date per case

Every case carries metadata: where it came from (production trace, hand-authored, incident postmortem), when it was added, when its label was last reviewed, and how often it has flipped. This is what makes the rest of the protocol mechanical instead of heroic. “Show me every case not reviewed in nine months” becomes a query, not an archaeology project. A case with no provenance is a case you cannot maintain, because you cannot tell whether it is stale — exactly the way an observable agent is only as auditable as the metadata trail it leaves.

Measure the set’s discriminative power

The deepest check, and the one almost nobody runs: periodically ask whether the golden set can still tell a good build from a bad one. The test is direct — take a build you know is worse (an older or deliberately degraded checkpoint) and one you know is better, and confirm the set scores them in the right order with a meaningful gap. If the known-bad build still passes, or the gap has compressed to noise, the set has lost discriminative power and no amount of fresh cases at the margin will restore it. This is the eval-data analogue of the judge meta-eval: that essay measures whether your judge can still grade; this measures whether your data can still separate. A suite that gates merges should be re-qualified to do its job, on a cadence, like any instrument that matters.

Coverage — does the set still mirror production

Versioning and re-review keep the set correct. Coverage is a separate axis: is the set still representative. A golden set can be perfectly labeled and entirely current and still measure the wrong thing, because its mix of cases no longer matches the mix of cases users actually send.

Audit it the way you would audit any sample for bias. Bucket production traffic by the dimensions that matter — feature area, query type, user segment, language, difficulty — and bucket the golden set the same way. Compare the histograms. The gaps are your blind spots: a feature that is 30% of traffic and 5% of the golden set is a feature your eval is mostly not testing, and a regression there will ship green. New buckets that exist in traffic and not in the set at all are pure blind spots. Coverage drift is quieter than stale facts — nothing in the set looks wrong — which is exactly why it needs a deliberate, scheduled comparison against the live distribution rather than a glance.

This is also the line between this post and its sibling. Coverage and freshness are properties of the eval data — this post’s subject. Whether the model grading that data is itself biased or miscalibrated is a property of the eval judge, and that is the LLM-as-judge essay. A trustworthy eval needs both. A rotting set with a perfect judge and a perfect set with a drifting judge fail the same way — a confident number nobody should trust.

The checklist

A golden set is maintained, or it is decaying. There is no third state.

The golden set is versioned — tagged, dated, with a changelog of every add, retirement, and label correction.
Fresh cases are sampled from production traces on a regular cycle, weighted toward what the current system handles poorly.
Stale cases (golden answer describes a world that no longer exists) are re-labeled or retired.
Permanently-solved cases are retired or moved to a separate smoke-test tier so they stop diluting the signal.
Labels are re-reviewed on a schedule by someone other than the original author; corrections are logged.
Every case carries provenance and a last-reviewed date, so “what is stale” is a query.
The set’s discriminative power is re-checked on a cadence — it still scores a known-bad build below a known-good one, with a real gap.
Coverage is audited against production: the set’s distribution by feature, query type, and segment is compared to live traffic, and the gaps are known.
Eval results are reported against a specific golden-set version, never an unversioned mutable set.

Reading list

Northcutt et al., Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks — the ~3.4%-label-error result across ten canonical benchmarks, and the proof that fixing labels reorders leaderboards.
A Careful Examination of Large Language Model Performance on Grade School Arithmetic (Zhang et al., 2024) — the GSM1k study; built a fresh held-out set and measured accuracy drops of up to ~13%, the cleanest demonstration of benchmark overfitting.
Recent Advances in Large Language Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation — a 2025 survey of why static benchmarks decay and what continuously-refreshed evaluation looks like.

A golden set is a sample of a moving target, and a sample goes stale. Maintain it on a cadence, or accept that your pass rate is a number from last year wearing this year’s date.