Two teams tell you their RAG system is “accurate.” Press them.
The first team means the answers are true. They graded a hundred outputs against what they know to be correct, and ninety were right. The second team means the answers are supported — every sentence the model produced traces back to a chunk the retriever returned. They also got ninety percent.
Both reported “90% accurate.” Neither measured the same thing. And both numbers, on their own, can hide a system that is about to embarrass someone in production.
“Accuracy” is not a RAG metric. It is a word that lets a team avoid picking one. This essay is the taxonomy that picks.
”Accuracy” hides the failure mode
A RAG answer is the product of two systems: a retriever that fetches context, and a generator that writes an answer from it. Either can fail independently. A single accuracy score collapses both into one number and tells you nothing about which one to fix.
You need at least two numbers, because there are two questions:
- Is the answer true? — does it match reality?
- Is the answer grounded? — does every claim trace to something the retriever actually returned?
Call the first faithfulness and the second groundedness. The labels are not standardized — RAGAS calls grounded-in-context “faithfulness”; other tools say “groundedness” or “attribution”; “correctness” and “answer accuracy” get swapped freely. Pick names, write them down, make the whole team use them. The names matter less than refusing to collapse the two ideas into one — because each can be high while the system is broken. That is the whole point.
Groundedness: every claim traces to a source
Groundedness asks a narrow question. Take the answer, break it into atomic claims, and for each claim check: is it supported by the retrieved context? Not by the model’s training data. Not by reality. By the specific chunks in the context window for this query.
A grounded answer is auditable. You can put a citation next to every sentence. For anything regulated — a statement about a policy, a contract, a filing — that is not a nice-to-have. It is the product.
Here is the failure mode groundedness misses: an answer can be perfectly grounded and completely wrong. If the retriever returns a stale document — last year’s pricing, a superseded contract clause, a memo that was later retracted — the model will faithfully ground its answer in that document. Every claim cites a source. Every citation checks out. The answer is still false, because the source was wrong.
Groundedness measures the generator’s honesty about its context. It says nothing about whether the context deserved that honesty.
Faithfulness: the answer is actually true
Faithfulness asks the question groundedness skips: set the retrieved context aside — is the answer correct? You grade it against a known-good answer, the way you would grade a human.
And here is the failure mode faithfulness misses: an answer can be true and completely ungrounded. Ask a RAG system “what is the capital of France” and it will say Paris whether or not the retriever returned anything useful. The model knew it. The answer is faithful. The retrieval pipeline contributed nothing — and the faithfulness score just reported that the system works.
It does not. You have an expensive vector database subsidizing a model that is answering from memory. The day a question arrives that the model cannot answer from memory — the one the whole RAG system exists for — retrieval is exposed, and the faithfulness score that looked fine all quarter never warned you.
Faithfulness measures the outcome. It does not tell you the retriever earned it.
The four quadrants
Put the two metrics on two axes and the system’s actual state falls out:
GROUNDED NOT GROUNDED
┌──────────────────────┬──────────────────────┐
FAITHFUL │ Ship it. │ Retriever is dead │
(true) │ True and auditable. │ weight — the model │
│ │ is answering from │
│ │ memory. │
├──────────────────────┼──────────────────────┤
NOT │ Your sources are │ Broken — and it will │
FAITHFUL │ wrong. Retrieval │ fail loudly, so at │
(false) │ works; the corpus │ least you will see │
│ is stale. │ it. │
└──────────────────────┴──────────────────────┘
The diagonal is what a single accuracy number cannot see. Grounded-but-false and faithful-but-ungrounded both produce respectable-looking scores on the wrong metric, and both are shipping hazards. Grounded-but-false ships confident misinformation with citations attached. Faithful-but-ungrounded ships a system whose retrieval has never actually been tested.
You want the top-left. You can only confirm you are in the top-left by measuring both axes.
The metrics that sit underneath
Faithfulness and groundedness are outcome metrics — they grade the final answer. When one of them drops, you need the diagnostic metrics that say where:
- Context precision / recall — of the chunks retrieved, how many were relevant; of the relevant chunks that exist, how many were retrieved. Low recall caps faithfulness no matter how good the generator is.
- Precision@k — did the gold document land in the top k. The retriever’s own scorecard. (More on earning a good one in the chunking essay.)
- Refusal correctness — when the answer is genuinely not in the corpus, does the system say so, or improvise? A RAG system that never refuses is not faithful. It is lucky.
Outcome metrics tell you the system is sick. Diagnostic metrics tell you the organ. You need both layers, and you need them on the same dashboard.
Building the harness
None of this is measurable without a golden set — the same graded question set the chunking essay argues you build before touching anything else. For each question, store the known-good answer and the document(s) the answer lives in. That single artifact feeds every metric above.
Scoring runs as an LLM-as-judge pass:
- Groundedness — give the judge the answer and the retrieved context. Prompt it to decompose the answer into claims and mark each one supported or unsupported by the context.
- Faithfulness — give the judge the answer and the golden answer. Prompt it to decide whether they agree on the facts.
Two separate judge calls, two separate prompts, two separate scores. The moment you merge them into one “rate this answer 1–10” call, you are back to “accuracy” and you have learned nothing. Keep them apart.
Then gate on it. The eval suite runs in CI; a regression in either score blocks the merge. The harness is one JSON file of questions in the repo and one pytest -k eval_ invocation — the same shape that grades a chunker, an embedder, or a prompt change. Build it once; every later decision gets cheaper.
The number that gates a deploy
You will be asked for one number. Resist giving it — give two, and a rule.
The rule we reach for: a deploy gates on groundedness ≥ threshold AND faithfulness ≥ threshold, never on an average. Averaging is exactly what lets a 0.95 on one metric paper over a 0.70 on the other. Two gates, both hard.
If you are forced to watch a single line on a chart, watch grounded faithfulness — the fraction of answers that are both true and fully traceable to the retrieved context. It is the top-left quadrant expressed as a percentage. It cannot be gamed by a model answering from memory, and it cannot be gamed by a model parroting a stale source. It is the only number that moves up only when the whole system — retriever and generator together — actually got better.
“Is it accurate” was never the question. “Is it true, is it grounded, and did you measure them separately” is. Three questions. A team that cannot answer all three does not know whether its RAG system works. It just hasn’t been caught yet.
Reading list
- Braintrust, RAG evaluation metrics. A clear walk through the metric set and how to wire it into a harness.
- Patronus AI, the Lynx hallucination-detection model — useful if you want a purpose-built judge rather than a general one.
- The RAGAS documentation on faithfulness and context precision/recall. Read it, then notice that its “faithfulness” is this essay’s “groundedness.” That mismatch is the whole problem.