Eval-driven development: write the eval before the feature.

A team is asked to build one feature: the agent should decline to answer when the retrieved context does not support an answer. Two engineers read that sentence. One builds an agent that declines when retrieval returns nothing at all. The other builds one that declines when retrieval returns passages but none of them actually addresses the question. Both ship. Both pass code review. Both demo cleanly on the handful of examples their authors happened to try. They are different products — they diverge on a large, common class of inputs — and nobody has noticed, because the sentence that specified the feature was not precise enough to build from, and nothing downstream of it forced the imprecision into the open.

The missing artifact is an eval. Not the eval most teams mean — a regression check written the week after launch, which can only ratify whatever the feature already happens to do — but an eval written first: a graded set of context-and-question pairs, each one labeled, before any code, with the verdict the feature is supposed to produce. You cannot assemble that set without deciding what “supported” means, because you cannot label a single borderline case without deciding it. The case where retrieval returned three passages that mention the right product but never state the number the question asks for is not a case the prose requirement adjudicates — but the eval has to give it a verdict, pass or fail, written down. Writing the eval is the act of finishing the specification.

This essay is about that inversion. Eval-driven development is not “add tests to your LLM app.” It is the claim that for anything built on a model, the graded eval set is the specification — the most precise statement of intended behavior the team will ever produce — and that authoring it before the feature, rather than after, is what separates a spec you can build from a sentence you can argue about.

”Eval” has come to mean the wrong thing

For most teams the word “eval” names a regression suite: a set of cases, run in CI, that catches a score dropping between one version and the next. That is a real and useful thing, and nothing here argues against it. But notice what it cannot do. A regression suite written after the feature shipped encodes the behavior the feature already has — it samples the deployed system’s outputs and freezes them as the reference. If the feature shipped with the wrong reading of “supported,” the eval faithfully locks in the wrong reading and then defends it: every future pull request that nudges the behavior toward the other reading now fails the suite and looks like the regression. The artifact meant to catch mistakes is actively protecting one. That is the structural problem with an eval that comes last. It ratifies. It does not specify, and worse, it cannot tell the difference between a regression and a correction.

Eval-driven development is the other order. The graded set comes first, and the feature is built to pass it. Both major model labs now describe this explicitly. Anthropic’s guidance on agent evals frames it as building “evals to define planned capabilities before agents can fulfill them” — capabilities the agent does not have yet, defined by the eval that will say when it does. OpenAI’s account of eval-driven system design is blunter about the failure it replaces: making evals the core process “prevents poke-and-hope guesswork and impressionistic judgments of accuracy, instead demanding engineering rigor.” Poke-and-hope is the default whenever the eval comes last — a developer changes a prompt, runs three queries by hand, decides it looks better, and ships. There is no scoreboard, so “better” is whatever the person who made the change feels after a small sample they themselves chose. The eval coming first is what replaces that feeling with a number.

Where the specification stops being ambiguous

Return to the two engineers. The reason they built different products is not carelessness; it is that the requirement contained an undetected ambiguity, and prose review does not surface that kind of ambiguity. A reviewer reads “decline when the context does not support an answer,” nods, and moves on, because the sentence is grammatical and reasonable and matches the reviewer’s own mental model — which is, of course, just one of the two readings. Prose review confirms that a requirement is sayable. It does not confirm that it is decidable, and an ambiguous requirement is perfectly sayable.

A graded eval is the forcing function that prose review is not. To label thirty examples you must, thirty times, decide whether this context supports this answer. Twenty-five of those will be obvious and both engineers would agree. The remaining five — retrieval returned something topical but thin, a passage that is adjacent to the answer without containing it — are precisely the cases where the two readings diverge, and the eval does not let you skip them. It demands a label. Producing that label is the team sitting down and deciding the question the requirement left open, on concrete inputs, with the answer recorded. Anthropic makes this the explicit argument for writing evals early: defining eval tasks “is one of the best ways to stress-test whether the product requirements are concrete enough to start building.” If you cannot label the set, the feature is not specified yet, and the eval just told you so — before the code was written, instead of after it was deployed two different ways and a customer hit the seam.

What a good eval task looks like

An eval is only a specification if its verdicts are not themselves ambiguous. An eval whose own labels are contestable has merely moved the argument, not ended it. Three properties carry most of that weight.

Inter-annotator agreement. A good task is one where two people who understand the domain would independently reach the same pass/fail verdict. If they would not, the task has the same disease as the original requirement — it is underspecified — and the fix is to sharpen the task definition until they would agree. That sharpening is the spec work; the disagreement is the requirement’s ambiguity made visible on a specific input, which is exactly what you wanted.
Drawn from real failures, not invented. Anthropic’s starting recommendation is 20–50 tasks pulled from actual failure sources — bug trackers, support queues, transcripts of real use. Invented tasks encode what you imagine users do; real ones encode what they actually do, and the gap between those two is where most production failures live. The borderline cases that matter are the ones users actually generate, not the ones an engineer thinks to type.
Balanced across the decision. Include the cases where the behavior should fire and the cases where it should not. An eval that contains only “should decline” examples is passed perfectly by an agent that declines everything — including every question it should have answered. Class imbalance silently rewards a degenerate solution, and the degenerate solution will score high enough to look done.

Grade with the cheapest method that works

A specification that is expensive to check gets checked rarely, and a spec checked rarely stops being a spec — it becomes a document the team consults once and then drifts away from. The grading hierarchy, from Anthropic’s guidance, runs cheapest and most reliable first: code-based grading — exact match, a regex, a numeric comparison against a known answer — then human grading, then LLM-as-judge. Prefer the highest rung the task allows, and “prioritize volume over quality”: a thousand questions graded automatically is a stronger specification than fifty graded by hand, because it covers more of the input space and, decisively, it runs on every commit without a person in the loop. An eval that needs a human to convene is an eval that runs monthly; an eval that runs in CI is one that runs forty times a day.

LLM-as-judge earns its place when the output is open-ended enough that code cannot grade it — a free-text answer where there is no single string to match. It is also more trustworthy than its reputation: the MT-Bench work (arXiv 2306.05685) measured strong LLM judges agreeing with human preferences over 80% of the time, the same rate at which humans agree with each other. But a judge is itself a model, with position, verbosity, and self-preference biases — it can prefer the first answer shown, the longer answer, or an answer from its own model family — and a judge is a thing you have to evaluate and maintain in turn, which means it needs its own small graded set of human-labeled verdicts. The failure modes are the subject of the LLM-as-judge essay. Use it; do not trust it unaudited.

One more property: the success criterion attached to the eval has to be a number. “Make the agent better at declining” is not a spec — it has no point at which it is satisfied. “F1 of at least 0.85 on the held-out refusal set” is — specific, measurable, falsifiable — and it tells you when the feature is done and, equally, when a change has broken it. The eval is the spec; the threshold is the definition of done.

The eval becomes the gate

Once the graded set exists and carries a threshold, it stops being a document and becomes a control. Wire it into CI so a change that drops the score below threshold blocks the merge — the same gate the faithfulness-and-groundedness essay argues for on RAG quality.

   eval-as-spec authored first
   20–50 labeled cases + numeric threshold
              │
              ▼
   ┌──────────────────────┐      below threshold
   │  feature built to    │  ──────────────────────►  fix, re-run
   │  pass the eval       │
   └──────────┬───────────┘
              │ meets threshold
              ▼
   ┌──────────────────────┐
   │  CI merge gate        │ ◄── every prompt edit, model swap,
   │  runs the eval set    │     retrieval change re-enters here
   └──────────┬───────────┘
              │ pass                    fail
              ▼                          │
          merge ──► production           ▼
              │                  blocked at the PR,
              │                  not found in prod
              ▼
   new failure observed ──► added as a labeled case ──► eval grows

Now the specification is enforced continuously rather than consulted occasionally: every prompt edit, model swap, and retrieval change is measured against the behavior the team agreed on, and a regression surfaces at the pull request — where it is one red check and a five-minute fix — instead of in production, where it is a customer report and an incident. OpenAI calls the resulting loop an evaluation flywheel: the eval catches a gap, the gap motivates a fix, the fix is verified by the eval, and the eval grows as new failures are found and folded back in as fresh labeled cases. The flywheel only turns if the eval existed before the thing it gates — a gate installed after launch is measuring against the launched behavior, which is the ratification trap again.

The honest limits

Eval-driven development is a strong default, not a universal law, and the case for it is weaker if oversold.

Not every requirement reduces to a graded set. Some properties — overall tone, the feel of a long multi-turn interaction, aesthetic quality — resist crisp pass/fail labeling, and forcing them into an eval produces a bad eval that specifies the wrong thing precisely: a rubric that scores tone on a five-point scale will optimize the rubric, not the tone. For those properties, the eval is one input among several — alongside human review and judgment — not the whole spec.

And the strict claim — that writing the eval before the feature beats writing it during — is not something the literature has cleanly isolated; even Anthropic’s guidance allows that some teams build the eval alongside the work, and a team that writes eval and feature in tight alternation is not making the mistake this essay warns about. The defensible claim is the one the essay actually rests on: the eval must exist early, and certainly before anyone calls the feature done, because its job is to finish the specification, and a spec that arrives after the build cannot do that job. Finally, an eval is not a fixed asset — golden answers go stale as the product and the world move, a labeled “correct” answer can quietly become wrong, and a rotting eval lies about the pass rate while looking green. The eval needs maintenance like any other part of the system.

The recommendation

For any feature built on a model:

Before writing feature code, write the graded eval set — 20–50 real cases, labeled with intended verdicts.
If you cannot label a case, treat that as an unfinished specification and resolve it now, in the requirement.
Confirm two domain experts would agree on every label; balance cases across the decision boundary.
Attach a numeric threshold — that number is the definition of done.
Grade with the cheapest reliable method; reserve LLM-as-judge for what code cannot grade, and audit the judge.
Wire the eval into CI as a merge gate before the feature ships, not after.

The question that decides whether a feature is specified is simple: is there a graded set, written down, that says what it should do? If there is not, the feature is not behind schedule for lacking an eval — it is unspecified, and every engineer who touches it is quietly building a different product.

Reading list

Anthropic — demystifying evals for AI agents; the explicit case for defining evals before the capability exists: anthropic.com
OpenAI — eval-driven system design; evals as the core process that replaces poke-and-hope: developers.openai.com
Anthropic — define success criteria and build evaluations; the grading hierarchy and measurable criteria: platform.claude.com
Judging LLM-as-a-Judge with MT-Bench — the >80% judge–human agreement figure, and the judge’s biases: arXiv 2306.05685
A Survey on LLM-as-a-Judge — building judges you can rely on: arXiv 2411.15594
Ragas — reference-free evaluation of RAG pipelines, when the feature is retrieval: arXiv 2309.15217

A feature without an eval is not a feature with a missing test. It is a sentence — and every engineer who reads it is free to build something else.

Eval-driven development: write the eval before the feature.

”Eval” has come to mean the wrong thing

Where the specification stops being ambiguous

What a good eval task looks like

Grade with the cheapest method that works

The eval becomes the gate

The honest limits

The recommendation

Reading list

Indirect prompt injection, by the numbers.

Your learning-rate schedule silently overrides your data-curation decisions.

Approving an agent's action is not authorizing it.

Tell us about it.

Got it.