A RAG system is returning wrong answers about one query in six. The team makes the obvious first move: retrieve more. They raise the retriever from the top five chunks to the top twenty, on the reasoning that if the passage the answer needs was not among the five, it is likelier to be among the twenty. And it is — instrument the candidate set and the right passage lands in it more often now. The answers get worse anyway, and slower. Handed twenty chunks, the model attends to a confidently irrelevant one, treats it as evidence, and writes an answer around it; the larger context dilutes the signal that the smaller set would have kept in focus. The change moved the right number — recall went up — and the thing the user sees went down.
The reflex misdiagnosed the failure. A retriever’s job is recall: get the right passage into the candidate set at all. Once recall is adequate — once the passage is reliably somewhere in the top twenty — retrieving still more does nothing for recall and actively hurts what is now the real problem, which is precision at the very top of the list. The generator does not read the candidate set as a ranked list and reason about rank; it reads the chunks as context, and a wrong chunk near the top is a wrong chunk it will use. The thing that fixes precision at the top is not a bigger retriever. It is a second stage: a reranker.
This essay is about that second stage — what a reranker is, why it beats retrieving more, what it costs, and the part most write-ups skip: when it is not worth paying for. Reranking is close to the highest-return change available to a struggling RAG pipeline. It is also not free, and it is not unconditional.
Retrieval and reranking are not the same operation
A vector search is a bi-encoder. The query and every document are embedded separately — the documents long in advance, at index time — into single fixed vectors, and retrieval is a nearest-neighbour lookup among those vectors. That separation is what makes it fast: the expensive transformer passes over the documents already ran, offline, and what is left at query time is geometry — a similarity search that modern indexes answer in milliseconds over billions of vectors. The same separation is what makes it coarse. The query embedding and the document embedding were each computed without ever seeing the other. Whatever subtlety the query had — which clause is the constraint, which word is the rare one that actually matters — was compressed into a fixed vector before the document existed in the computation, and the document’s vector was frozen before the query existed. Their relevance is then approximated by one dot product between two independent summaries. A dot product is a thin channel through which to decide whether a passage answers a question.
A cross-encoder does the thing the bi-encoder skipped. It takes the query and one candidate document together, concatenated into a single input, and runs the full transformer over the pair, so every query token attends to every document token and back. The output is a single relevance score for that exact pair, computed with both halves present. As Pinecone’s account of two-stage retrieval puts it, the cross-encoder runs “the raw information directly into the large transformer computation” rather than comparing two pre-made embeddings. That is structurally a richer judgement: the model can notice that the passage mentions the query’s entity but in the wrong relation, or that a single qualifying word flips the passage from relevant to not. It is far more accurate about relevance — and it cannot be precomputed, because the computation needs the query, which is exactly why it can only ever be a second stage.
STAGE 1 bi-encoder retrieval STAGE 2 cross-encoder rerank
query ───►[embed]──┐ ┌─ (query, doc) ─►[full transformer]─► 0.91
├─ nearest ├─ (query, doc) ─►[full transformer]─► 0.12
docs ──►[embed]───┘ neighbour ├─ (query, doc) ─►[full transformer]─► 0.74
(embedded at index time) └─ ...
fast · coarse · scales to billions accurate · costly · scales with N
job: recall — get it in the set job: precision — order the top
The division of labour is the whole design, and the two stages are tuned against different metrics. Stage one runs over the entire corpus and only has to be good enough to get the right passage into a candidate set of a few dozen — it is judged on recall, and a recall of “almost always in the top fifty” is a passing grade. Stage two runs only over that small candidate set and has to be good enough to put the right passage first — it is judged on precision at the top, and there “almost” is a failure. Neither stage can do the other’s job: a bi-encoder cannot rank the head of the list well, and a cross-encoder cannot be run over a billion documents. They are not redundant; they are complementary.
Why “just retrieve more” backfires
Raising the candidate count helps recall and nothing else. If the passage was already reliably in the top twenty, going to fifty adds no recall — the passage was already there — and it makes precision-at-top worse, because the generator is now choosing among fifty things instead of five, and most first-stage retrieval scores are too coarse to rank the head of that longer list well. You have enlarged the haystack and kept the same needle.
It also damages the generator directly, and this is the part the recall framing hides. Pinecone’s account states the effect plainly: “LLM recall degrades as we put more tokens in the context window.” Stuffing twenty marginally relevant chunks into the prompt does not give the model twenty chances to be right; it gives it nineteen distractors and one signal, and the model spends attention on the distractors — it cannot tell, from inside the prompt, which chunk was rank one and which was rank twenty. Worse, a distractor is not neutral. A chunk that is topical but wrong reads to the model like supporting evidence, and the model will quote it. “Retrieve more” treats the symptom — the right passage not being read — by making the disease worse: it adds exactly the kind of plausible-but-wrong context that produces a confident wrong answer. The right passage needs to be at the top, where the generator weights it most and the distractors are gone, and moving it there is a different operation from enlarging the set. This is the same lesson the chunking essay reaches from the indexing side: RAG quality is a precision problem long before it is a volume problem.
What a reranker actually buys
The gain is measurable and large. “In Defense of Cross-Encoders for Zero-Shot Retrieval” (arXiv 2212.06121) found a cross-encoder reranker beating a comparable bi-encoder by more than four points, on average, across the eighteen datasets of the BEIR benchmark — and, importantly, the gains were “much larger in new domains.” That qualifier is the useful part. A bi-encoder is strong on corpora that resemble its training data and weak on corpora that do not, because its embeddings encode the relevance patterns it was trained on; the reranker, reading query and passage together at query time, recovers relevance the frozen embeddings could not. So the reranker helps most exactly where dense retrieval is weakest — on the specialised, out-of-domain corpus that most production RAG systems actually run on, rather than the clean benchmark corpus a retriever was tuned for.
The lineage of rerankers worth knowing is short, and each step widened where the technique works. monoT5 (arXiv 2003.06713) established the sequence-to-sequence reranker — frame relevance as a generation task — and showed it transferring zero-shot to out-of-domain ranking tasks it was never trained on. RankT5 (arXiv 2210.10634) trained T5 rerankers with ranking-specific losses rather than a generic objective, and reported notably “better zero-shot ranking performance on out-of-domain data sets.” Listwise LLM rerankers — RankGPT (arXiv 2304.09542) — change the scoring unit: instead of scoring each candidate alone, they score the candidates against each other, ordering the whole list in one pass, and can be competitive with supervised methods. The common thread across all three is the one mechanism: a reranker reads query and passage together, and that joint read is worth several points of ranking quality the bi-encoder structurally cannot recover, no matter how large the embedding model.
The latency you pay
The cost is the mirror image of the benefit. Because a cross-encoder cannot precompute anything, it must run the full transformer once per candidate at query time, and its cost therefore scales linearly with the number of candidates — double the candidates, double the reranking time. Pinecone’s worked figure makes the scale concrete: reranking 40 million documents with a BERT-class cross-encoder on a V100 would take more than fifty hours, against under 100 milliseconds for the bi-encoder’s nearest-neighbour search over the same set. Fifty hours versus a tenth of a second is not a tuning difference; it is the reason the architecture has two stages at all.
That ratio is why a reranker is only ever a second stage. You never rerank the corpus; you retrieve a small candidate set cheaply — twenty-five, fifty, a hundred — and rerank only that. The reranking cost is then bounded by the candidate count, not the corpus size: a hundred cross-encoder passes is a fixed bill that does not grow as the corpus grows from a million documents to a billion. In that bounded regime the cost lands in the range of tens to low hundreds of milliseconds. That is real latency, and it sits on the critical path of every single query — not amortized, not offline, paid every time. It is the number the decision below turns on, and a pipeline with a tight latency budget has to look at it honestly rather than assume the “cheap win” is free.
When it does not earn its latency
This is the half the “cheapest win” framing tends to drop. A reranker is high-return when conditions hold, and the conditions do not always hold.
- The retriever may already be precise enough. If the right passage is usually already at rank one or two, a reranker reorders a list that was fine and bills you the latency for it — the second stage runs, finds the same passage already on top, and changes nothing except the response time. Measure top-k precision first; if it is high, the reranker has nothing to fix and you are buying latency for no gain.
- The gains on your queries may be smaller than the benchmark gains. The 2025 study “How Good are LLM-based Rerankers?” (ACL Findings 2025) evaluated twenty-two reranking methods and found that generalization to genuinely novel queries “varies,” that lightweight models often match heavyweight LLM rerankers on efficiency-comparable terms, and — the sharp caveat — that some headline LLM-reranker advantages “may partly reflect data leakage rather than superior reasoning.” A benchmark number measured on a public dataset can be inflated by that dataset having leaked into training. Benchmark lift is a hypothesis about your traffic, not a promise.
- A big reranker is rarely necessary. RankGPT showed a distilled 440-million-parameter reranker outperforming a 3-billion-parameter supervised model — an order of magnitude smaller and still ahead. If you do add a reranker, the latency-cheap small one is usually the right call; reach for an LLM-scale reranker only if an eval on your own data says the small one left points on the table, because the large one’s latency is several times the small one’s for a gain that may be zero.
- The end-to-end gain is contingent. The BERGEN RAG-benchmarking work (arXiv 2407.01102) is a reminder that a reranker’s contribution to final answer quality depends on the retriever and generator around it — a better-ordered candidate list only helps if the generator was actually losing answers to bad ordering. A component win is not automatically a system win, and only an end-to-end measurement tells you which you got.
The decision
Add a reranker when all three conditions hold: the retriever has recall headroom — the right passage is usually in the top-N but not in the top-k the generator actually sees, so there is genuinely a misordering for the second stage to fix; the candidate set is bounded to tens or low hundreds, so the reranking cost is a fixed bill rather than a corpus-scaled one; and the latency budget can absorb tens-to-low-hundreds of milliseconds on every query. In that regime — which is most struggling RAG pipelines — it is genuinely the cheapest large win available, because it touches one stage and leaves indexing and generation untouched: no re-embedding the corpus, no prompt surgery, one component added between two that stay as they were.
Do not add one, or do not add one yet, when the retriever’s top-k precision is already high, when the latency budget is already tight, or when you have not yet measured which of recall and precision is actually failing — adding a reranker to a recall problem fixes nothing, because the right passage was never in the candidate set for the second stage to promote. And whichever way the decision goes, gate it on a held-out eval, the way the eval-driven development essay argues — a reranker is a change to retrieval quality, and a change to retrieval quality you did not measure is a change you do not understand, however reasonable the mechanism sounds. The retrieve-then-rerank split also composes with the lexical-versus-dense question the hybrid-search essay takes up: when you run both a lexical and a dense retriever, rerank the single fused candidate list, not either retriever’s output on its own.
The checklist
Before adding a reranker to a RAG pipeline, confirm:
Reading list
- In Defense of Cross-Encoders for Zero-Shot Retrieval — the >4-point BEIR gain and why it grows out-of-domain: arXiv 2212.06121
- Pinecone — rerankers and two-stage retrieval; the cross-encoder mechanism and the 50-hour latency figure: pinecone.io
- monoT5 — the sequence-to-sequence reranker and its zero-shot transfer: arXiv 2003.06713
- RankT5 — ranking-loss fine-tuning and stronger out-of-domain ranking: arXiv 2210.10634
- RankGPT — listwise LLM reranking, and a 440M model matching a 3B one: arXiv 2304.09542
- How Good are LLM-based Rerankers? — the honest counter-evidence: varying generalization, leakage, lightweight parity: ACL Findings 2025
- BERGEN — a RAG benchmarking library; reranking’s end-to-end contribution is contingent: arXiv 2407.01102
Retrieving more gives the model more to ignore. Reranking gives it the right passage first. One of those is the fix.