# LLM-as-judge is a model you also have to evaluate.

A team builds an eval harness the right way. A golden set of questions, a known-good answer for each, a regression gate in CI: a build that scores below threshold does not merge. The grading is automated — an LLM reads each answer and the reference, and returns a score from one to ten. The harness runs on every pull request. The dashboard is green. Releases ship on its say-so.

Now name the one component in that loop that nobody evaluated. The retriever was measured. The generator was measured. The prompts were measured. The judge — the model whose number decides whether a release ships — was wired in and trusted. Its scores are treated as ground truth, the fixed point everything else is measured against. It was never itself put on a bench.

That is the gap this post is about. An LLM-as-judge is not an oracle. It is a model doing a hard task — reading two pieces of text and forming a preference — and it brings to that task the same failure modes every other model has: documented biases, shaky calibration, and a tendency to drift when the vendor updates it under you. A judge you have not evaluated is an unmeasured instrument, and an unmeasured instrument cannot tell you whether your system got better. This post is about evaluating the judge before you let it gate anything.

## The judge is a model, so it fails like one

Teams reach for an LLM judge because the alternatives do not work: a human grading every output on every pull request does not scale, and exact-match scoring cannot grade an open-ended answer. Both reasons are sound; the judge is genuinely useful. The error is what happens next — the score comes back, and it is treated as fact.

It is not fact. It is a second model's output, produced by the same next-token machinery as the thing being judged, and it inherits the same properties: it is sensitive to phrasing, to order, to length; it has preferences unrelated to answer quality; it is, like every model, a little overconfident. None of this makes it useless — a noisy instrument is still an instrument — but it means the judge's output is a measurement with an error bar, and a measurement with an unknown error bar cannot gate a deploy.

The foundational study here is still [_Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena_](https://arxiv.org/abs/2306.05685) (Zheng et al., 2023) — and it is the study most often cited for the wrong reason. Its headline result, that a strong judge reaches roughly 85% agreement with human experts (higher than the 81% humans reached with each other), gets quoted as a clean bill of health. Read the rest of the paper: the same authors spend several sections cataloguing how their judge is biased and miscalibrated. The agreement number and the bias catalogue are the same finding — the judge is good enough to use and flawed enough that you have to measure it. Treat those as one sentence.

## The biases are documented, not hypothetical

The judge's preferences are not folklore. They have been quantified in peer-reviewed work, specific enough to test for. The catalogue worth knowing, with what the literature actually found:

| Bias                   | What the judge does                                                                          | Evidence                                                                                                                                                                        |
| ---------------------- | -------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Position bias          | Favors an answer for being first (sometimes last), not for being better                      | MT-Bench: only GPT-4 stayed consistent across a swap in more than 60% of cases; Claude-v1 agreed with itself just 23.8% of the time and picked the first answer 75% of the time |
| Verbosity / length     | Prefers the longer answer, quality held constant                                             | MT-Bench's "repetitive list" attack — pad an answer with restated items, no new information — fooled GPT-3.5 and Claude-v1 91.3% of the time                                    |
| Self-preference        | Rates its own family's outputs above what humans rate them                                   | MT-Bench observed GPT-4 giving itself a ~10% higher win rate and Claude-v1 a ~25% higher win rate; later work tied this to self-recognition                                     |
| Sycophancy             | Rewards an answer that agrees with the prompt or flatters the user                           | A recurrent finding across the bias literature — judges shift scores toward agreement and confident phrasing over correctness                                                   |
| Formatting / authority | Swayed by surface cues — markdown, structure, a citation, an assertive tone — over substance | The "Justice or Prejudice?" survey isolates formatting, sentiment, and fake-citation effects and measures a robustness drop for each                                            |

Two cautions on reading that table. First, the numbers are anchored to specific models and test setups; they are evidence that the bias exists and can be large, not constants to plug into your own system. Your judge, on your task, has its own coefficients — which is the entire argument for measuring it yourself. Second, the MT-Bench authors were careful about self-preference: with limited data and small margins, they declined to claim it as proven. The stronger evidence came later. [_LLM Evaluators Recognize and Favor Their Own Generations_](https://arxiv.org/abs/2404.13076) (Panickssery et al., 2024) showed judges have non-trivial accuracy at recognizing their own text, and found a linear link between that self-recognition and the strength of self-preference. A broader sweep — [_Justice or Prejudice?_](https://arxiv.org/abs/2410.02736) (Ye et al., 2024) — quantified roughly a dozen distinct biases, position and verbosity among them, and found that even strong models fail meaningfully on several.

The practical reading: position and verbosity bias are robust and large; self-preference is real and worst when your judge and a system under test come from the same family; the surface-cue biases are real and smaller. All are measurable on your own harness, and none show up in a green dashboard.

## Calibration — does "7 out of 10" mean anything

Bias is one failure. Calibration is the other, and the quieter one. Ask a judge to rate an answer from one to ten and it returns a number. The question calibration asks: is that number a measurement, or a vibe rendered as a digit?

Mostly the latter. A pointwise score on an unanchored scale has no agreed unit. The judge's 7 and its 8 are not separated by a defined increment; the gap is not stable across questions, across runs, or across two answers of genuinely different quality. Judges also bunch their scores — they reach for 7 and 8 and rarely for 3 or 9 — so the scale you think has ten levels has, in practice, about three. A regression gate set at "mean score ≥ 7.5" is then gating on a number whose precision it never established. The score moved from 7.6 to 7.4; nobody can say whether the system got worse or the judge twitched.

Pairwise comparison is the more reliable instrument, and the first practice to adopt. Instead of "rate this answer," ask "here are two answers — which is better, or are they tied." A relative judgment is a much easier task than inventing an absolute score, and it is the format the MT-Bench work and most bias studies use. It does not escape position bias — that is what answer-order randomization is for, below — but it removes the fiction of a calibrated ten-point scale. Where you genuinely need an absolute score, anchor it: a rubric with concrete descriptions of what a 3, a 6, and a 9 look like, and worked examples in the prompt, so "7" is a definition the judge applies rather than a feeling it reports.

This is also the boundary with the [faithfulness-versus-groundedness essay](/blog/faithfulness-vs-groundedness/): that post argues against collapsing two distinct RAG questions into one "rate this 1–10" call. Seen from the judge's side, the unanchored pointwise score is both an information-losing merge and an uncalibrated measurement. Decompose what you ask, and prefer relative judgments — the two essays reach that conclusion from different directions.

## Drift — your baseline moves while you sleep

Suppose you have done the work: measured the biases, switched to pairwise, anchored the rubric. There is still a moving part you do not control.

Most teams call a hosted judge — an API behind a model name. That name is not a version. The vendor updates the model behind it on their schedule, with no obligation to tell you and no changelog you can diff. The judge that graded last quarter's baseline is, after a silent update, a different judge. Your eval scores can shift by a few points with not one line of your code or your golden set having changed. The instrument was recalibrated underneath the experiment.

This is corrosive in a way a visible regression is not. A visible regression you investigate. Judge drift produces a number that is wrong in a way that looks exactly like a number that is right — a small movement, no obvious cause — so it gets absorbed as noise. Worse, it breaks comparison across time: this month's 8.1 and last month's 8.3 came from two different measuring devices, so their difference means nothing and any trend line through them is an artifact. An eval suite whose baseline silently moves is not a regression gate. It is a random number generator with a tasteful mean.

The fix is boring and non-negotiable: **pin the judge to a specific, dated model version**, never a floating alias. Treat a judge-version bump the way you treat a dependency bump — a deliberate change, with the meta-eval below re-run before and after so you see exactly what moved. This is the discipline [red-teaming MCP servers](/blog/red-teaming-mcp-servers/) applies to a different unpinned dependency: a thing that writes into your system and changes under you without a version is a thing you have not actually pinned.

## How to evaluate the judge

The judge is a model, so you evaluate it the way you evaluate a model: with a labeled test set and a metric. The test set here is a **meta-eval set** — a few hundred items, each an answer (or a pair) with a verdict assigned by a human you trust. This is a different artifact from the golden set the rest of the harness uses: the golden set measures the product, the meta-eval set measures the judge. With that set, four checks:

- **Agreement with human labels.** Measure how often the judge's verdict matches the human verdict. This is the single number that tells you whether the judge is fit to grade at all — the MT-Bench ~85% sits in this slot, and your judge gets its own.
- **An inter-rater statistic, not raw agreement.** Raw percent agreement flatters a judge on a skewed set — if 80% of answers are "good," a judge that always says "good" scores 80% while measuring nothing. Use **Cohen's kappa**, which corrects for chance agreement. The Landis–Koch convention reads 0.61–0.80 as "substantial" and 0.41–0.60 as only "moderate"; a judge below substantial agreement with your humans is not ready to gate a merge.
- **Calibration checks.** For pointwise scores, bin them and ask whether each bin tracks the human grade — does the set of answers the judge called "8" actually beat the set it called "6." If the bins do not separate, the scale is decorative; move to pairwise.
- **Bias probes.** Build the adversarial cases directly. Submit the same pair in both orders and measure how often the verdict flips — position bias as a number. Pad one answer with the repetitive-list trick and check whether the score rises — verbosity bias as a number. Run these in CI; they are cheap and they catch the judge degrading after a version bump.

This meta-eval does not run once. It runs on a schedule and on every judge-version change, because the thing it measures does not hold still.

## Practices that hold up

The literature converges on a short list — the difference between a judge you can defend and a number you hope is right.

- **Prefer pairwise to pointwise.** Relative judgments are an easier task and dodge the uncalibrated-scale problem. Reserve pointwise for cases that truly need an absolute score, and anchor those with a concrete rubric.
- **Randomize answer position.** On every pairwise call, randomize which answer is A. Better, run both orders and count a result only when the verdict is order-stable; an unstable verdict is a tie, not a win. This converts position bias from a silent thumb on the scale into a visible, handled quantity.
- **Use a panel, not a monarch.** A single judge is a single point of bias. [_Replacing Judges with Juries_](https://arxiv.org/abs/2404.18796) (Verga et al., 2024) showed a panel of three smaller models from different families correlated better with human verdicts than a single GPT-4 judge — at roughly seven to eight times less cost, because three small models are cheaper than one large one. A panel of disjoint families also dilutes self-preference: no single model's family dominates the vote. Cheaper and less biased is a rare combination; take it.
- **Pin the judge version.** A dated, specific model, never a floating alias. A version bump is a deliberate change, gated by a re-run of the meta-eval.
- **Keep a human-labeled anchor set.** The meta-eval set is the ground truth under the judge. Without it every other practice here is unfalsifiable — you are tuning a measurement you cannot check. The same way the [agent-observability essay](/blog/agent-observability/) argues a system is only as accountable as the evidence trail it keeps, an automated judge is only as trustworthy as the human-labeled set you can hold it against.

A judge run this way is still imperfect, and that is fine. It is now an imperfect instrument with a known error bar, a pinned version, and a meta-eval that catches it when it drifts — a thing you can gate a deploy on. A judge run the other way — pointwise, single-model, floating version, no meta-eval — is a number with a decimal point and no warrant.

## The judge is a dependency, not the foundation

The mistake at the top of this post was treating the judge as bedrock — the fixed reference everything else is measured against. It is not bedrock. It is another model in the loop, with the same failure modes as the models it grades, and it belongs under the same evaluation discipline. The bedrock is the human-labeled set; the judge sits on top of it, measured against it, pinned, paneled, and probed.

The judge is the model in your eval harness; the [golden set](/blog/golden-set-maintenance/) is the data in it — and that data rots too, on its own schedule. A trustworthy eval needs both halves maintained. This post is the first half; treat the second with equal suspicion.

## Reading list

- Zheng et al., [_Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena_](https://arxiv.org/abs/2306.05685) — the foundational study; read past the 85%-agreement headline to the bias sections, because they are the same finding.
- Ye et al., [_Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge_](https://arxiv.org/abs/2410.02736) — a systematic catalogue of judge biases with a robustness number for each.
- Panickssery et al., [_LLM Evaluators Recognize and Favor Their Own Generations_](https://arxiv.org/abs/2404.13076) — the cleanest evidence that self-preference is real and tied to a judge recognizing its own text.
- Verga et al., [_Replacing Judges with Juries_](https://arxiv.org/abs/2404.18796) — the case for a panel of small diverse models over a single large judge, cheaper and better correlated with humans.

You would never ship a model you had not evaluated. The judge in your eval harness is a model. Evaluate it, or admit you do not know what your green dashboard means.