# Hybrid search: when BM25 still beats your embeddings.

`ERR-4021` and `ERR-4201`. To a person, two unrelated error codes — one is a failed credential refresh, the other a malformed request body, and a support engineer would never confuse them. To an embedding model, two points a hair apart in vector space. The strings differ by one transposition, the model never saw either token in training, and so it embeds each one by surface shape — by the rough silhouette of the characters — and surface shape is nearly identical. Ask the dense index for a passage about `ERR-4021` and it will happily hand back the one about `ERR-4201`, with a high similarity score, because by the only measure it has the two are almost the same thing.

This is the failure that arrives a few weeks after a team retires keyword search for an embedding model. On the demo corpus — clean prose, general-knowledge questions — the embeddings looked like a clear upgrade: they matched a question to a passage that shared not one word with it, the move keyword search structurally cannot make. Nothing about that demo was wrong. The embedding model is doing exactly what it was built to do: map text to a region of space by meaning. The problem surfaces only on a corpus where meaning is partly carried by exact tokens — error codes, part numbers like `MX-7-A` against `MX-7-B`, statute numbers, SKUs, units — tokens the model never learned and therefore embeds by shape rather than by sense. On that corpus the old keyword index was not a legacy embarrassment waiting to be deleted. It was doing a specific job, and the embeddings cannot do that job.

This essay is about that job, why it does not go away, and why production retrieval almost always wants both a lexical and a dense index rather than a winner between them.

## The two retrievers fail in opposite places

Lexical retrieval — BM25 is the standard — scores a document by the query terms it literally contains, weighting each term by how often it appears in that document (a term that occurs many times is more central to it) and how rare it is across the corpus (a term that appears in few documents is more discriminating than one that appears in all of them). It matches tokens, nothing more. It runs on an inverted index — a map from each term to the documents containing it — needs no training and no model, and has no opinion about meaning whatsoever: `ERR-4021` matches the token `ERR-4021` and matches nothing else, which is precisely the property the opening needed. It also cannot match "login is broken" to a passage about authentication, because those share no tokens, and BM25 has no idea they are related.

Dense retrieval embeds the query and every document into vectors with a trained model and retrieves by vector nearness. It matches meaning. It can connect "my login is broken" to a passage about authentication failures that shares no words with the query, because in the embedding space the model learned, those phrases sit close together. But it can do that only for the language patterns the embedding model actually saw in training — its sense of meaning is exactly the sense it was trained into, and a token outside that training, like a product-specific error code, has no learned meaning to place it by. So it falls back on shape.

That opposition is the whole basis of the essay. One retriever matches tokens and is blind to meaning; the other matches meaning and is blind to unfamiliar tokens. They do not fail in the same places, and that is what makes running both worth its cost.

## Where dense retrieval wins

In-domain, dense retrieval is genuinely better, and the original Dense Passage Retrieval work ([arXiv 2004.04906](https://arxiv.org/abs/2004.04906)) put a number on it: trained on in-domain question–passage pairs, DPR beat a strong BM25 system by 9 to 19 points of top-20 retrieval accuracy. That is not a marginal edge — it is the gap between a retriever you trust and one you do not. When the corpus and the queries resemble what the embedding model was trained on, semantic matching wins, and it wins decisively, because it does the thing BM25 cannot do at all: it retrieves on synonymy and paraphrase, finding the passage that means the question without quoting any of its words. A user who asks "how do I reset my password" and a document headed "credential recovery procedure" share almost no vocabulary; the dense retriever connects them, the lexical one does not.

Hold onto the condition, because the whole decision turns on it: _in-domain_. The 9-to-19-point DPR gain is a gain measured on data that looks like the model's training data — questions and corpora of the kind the retriever was trained against. That is the case the demo corpus quietly satisfied, with its clean prose and general-knowledge questions, and the case the production corpus, full of proprietary identifiers and domain jargon, did not. The benchmark number is real; it is just a number about a specific situation, and reading it as an unconditional verdict on dense retrieval is the mistake that retires the keyword index.

## Where BM25 still wins

Out of domain, the picture inverts, and this is the central finding of the BEIR benchmark. BEIR ([arXiv 2104.08663](https://arxiv.org/abs/2104.08663)) evaluated retrieval systems across eighteen datasets spanning many domains and reported that "BM25 is a robust baseline" while dense models, computationally efficient as they are, "often underperform" once the corpus stops resembling their training data. The cross-encoder study ([arXiv 2212.06121](https://arxiv.org/abs/2212.06121)) sharpened the point for the architecture most production systems actually use: bi-encoder dense retrievers, used as the first stage, gave "no gains in comparison to a simpler retriever such as BM25 on out-of-domain tasks." No gains — the embedding model, the GPU index, the whole apparatus, drawing even with a token-counting algorithm from the early 2000s, on the kind of corpus a real deployment is most likely to have.

It is not even that every dense retriever loses out-of-domain — a strong unsupervised one can do well, and the picture is genuinely uneven rather than uniformly bad. Contriever ([arXiv 2112.09118](https://arxiv.org/abs/2112.09118)), trained without supervised relevance labels, beat BM25 on 11 of 15 BEIR datasets at Recall@100 — a clear majority — and still _lost_ to BM25 on specific ones: TREC-COVID and Touché-2020, corpora dense in specialised terminology. That split is the result in miniature. Dense retrieval's out-of-domain performance is not a flat line; it is high on some corpora and low on others, and the corpora where it falls behind are the ones full of exact, specialised terms — the error codes and part numbers from the opening, the technical vocabulary BM25 was always good at. BM25 wins there for a precise reason: matching a rare exact token is the entire thing it does, and matching a rare exact token is exactly what an embedding model trained on other text cannot reliably do, because that token was never in its training and so was never given a meaning.

## Learned sparse retrieval narrows the gap

The lexical/dense split is not a permanent border, and it is worth knowing the technique that sits across it. Learned sparse retrieval — SPLADE ([arXiv 2107.05720](https://arxiv.org/abs/2107.05720)) and SPLADE v2 ([arXiv 2109.10086](https://arxiv.org/abs/2109.10086)) — keeps the thing BM25 is good at, exact term matching on an inverted index, and adds something BM25 lacks: learned term expansion. A SPLADE model, at index time, expands a document into a weighted set of terms that includes words the document does not literally contain but is about — so a passage about "authentication" can be made retrievable by the token "login" as well, while still being a sparse, invertible representation. SPLADE v2 reported over 9% NDCG@10 gains on TREC DL 2019 alongside strong out-of-domain BEIR numbers. It sits in the middle of the spectrum — lexical exactness on one side, a learned semantic reach on the other — and it is genuine evidence that "lexical" and "dense" are two ends of a range, not two sealed boxes. But it does not dissolve the decision. A SPLADE index is still a model-dependent index with its own training distribution, and a team still has to reason about which signal — exact token or learned meaning — its corpus actually leans on.

## Hybrid is the production default

Because the failure modes are complementary — one retriever is strong exactly where the other is weak — the production answer is usually not to pick a winner. It is to run both indexes and fuse the two ranked lists into one. Anthropic's contextual-retrieval work gives the cleanest measured case: contextual embeddings alone cut the top-20 retrieval-failure rate by 35%; adding a contextual BM25 index alongside them cut it by 49%. Read the second number against the first. The dense retriever was already a strong, modern, carefully engineered embedding system — and bolting a decades-old lexical index next to it closed roughly another third of the gap that remained. The lexical index was not redundant beside a good embedding model; it was catching a class of failure the embedding model, however good, was always going to miss.

The standard way to combine two ranked lists is Reciprocal Rank Fusion ([Cormack et al., SIGIR 2009](https://cormack.uwaterloo.ca/cormacksigir09-rrf.pdf)). Each document's fused score is the sum, over both retrievers, of 1/(k + rank), with k a small constant the paper sets near 60. The design choice that matters is what RRF deliberately ignores: it never looks at the retrievers' raw scores, only at the ranks. That is the point. A BM25 score and a cosine similarity are numbers on incompatible scales — there is no honest way to add a BM25 score of 14.2 to a similarity of 0.71 — and any fusion that tried to combine the scores would need a calibration step that is fragile and corpus-specific. Ranks are comparable across any two retrievers by construction: rank one is rank one whatever produced it. By summing 1/(k + rank), RRF rewards a document that both retrievers rank highly, lets a document either retriever places near the top still score well, and needs no tuning at all. That is why it has become the default fusion step.

```
   query
     │
     ├──────────────────────┬──────────────────────┐
     ▼                      ▼                       │
  BM25 (lexical)        dense (vector)               │
  exact-token match     meaning match                │
     │                      │                        │
     ▼                      ▼                        │
  ranked list A         ranked list B                │
   1. doc P              1. doc Q                     │
   2. doc Q              2. doc R                     │
   3. doc R              3. doc P                     │
     │                      │                        │
     └──────────┬───────────┘                        │
                ▼                                     │
   RRF fuse:  score(d) = Σ 1/(k + rank_d)             │
   ranks only — no score calibration needed           │
                │                                     │
                ▼                                     │
   fused list  1. doc P   2. doc Q   3. doc R  ◄───────┘
                │
                ▼
   rerank the fused list  →  generator
```

Retrieve with BM25, retrieve with dense, fuse with RRF, and then — as [the reranking essay](/blog/reranking-rag/) argues — rerank the single fused list rather than either retriever's raw output, so the second-stage cross-encoder is ordering the best candidates from both retrievers at once.

## The decision

| Your corpus and queries                                 | Retrieval to run                                    |
| ------------------------------------------------------- | --------------------------------------------------- |
| In-domain, paraphrase-heavy, few exact identifiers      | Dense can lead; keep BM25 as a cheap safety net     |
| Rare exact terms — codes, part numbers, names, statutes | BM25 must be in the mix; dense alone will miss them |
| Out-of-domain vs the embedding model's training data    | Hybrid; do not trust dense alone                    |
| Unsure                                                  | Hybrid with RRF — the low-regret default            |

"Hybrid" is the safe answer here, but not out of indecision — it is a positive choice with three reasons behind it. The two retrievers fail on disjoint inputs, so running both genuinely covers more queries rather than buying two views of the same coverage. Fusing them with RRF is cheap and needs no tuning, so the second retriever adds little operational weight. And the asymmetry of the costs is decisive: the price of a second index is a modest, fixed amount of storage and compute, while the price of getting the choice wrong — shipping dense-only onto a corpus full of exact tokens — is silently missing every query that turns on an identifier, a failure that does not announce itself and surfaces only as a slow trickle of support tickets. A small certain cost against a large hidden one is not a close call. Measure it on your own corpus, the way [the chunking essay](/blog/chunking-not-rag/) insists — run real queries against each retriever and against the fusion — but expect the measurement to tell you to keep both.

- [ ] Checked whether your corpus carries meaning in exact tokens — codes, IDs, names, units.
- [ ] Tested retrieval out-of-domain, not only on a corpus resembling the embedding model's training data.
- [ ] Kept a BM25 index running alongside the dense one rather than retiring it.
- [ ] Fused the two with RRF; reranked the fused list, not one retriever's output.
- [ ] Measured recall on your own queries before trusting either retriever alone.

## Reading list

- BEIR — the heterogeneous retrieval benchmark; "BM25 is a robust baseline," dense often underperforms out-of-domain: [arXiv 2104.08663](https://arxiv.org/abs/2104.08663)
- Dense Passage Retrieval — the in-domain dense win, 9–19 points over BM25: [arXiv 2004.04906](https://arxiv.org/abs/2004.04906)
- In Defense of Cross-Encoders — bi-encoders give "no gains" over BM25 out-of-domain: [arXiv 2212.06121](https://arxiv.org/abs/2212.06121)
- Contriever — a strong unsupervised dense retriever that still loses to BM25 on specialised corpora: [arXiv 2112.09118](https://arxiv.org/abs/2112.09118)
- SPLADE v2 — learned sparse retrieval; exact-match efficiency with learned expansion: [arXiv 2109.10086](https://arxiv.org/abs/2109.10086)
- Anthropic — contextual retrieval; the 35%-to-49% failure-reduction step from adding BM25: [anthropic.com](https://www.anthropic.com/news/contextual-retrieval)
- Reciprocal Rank Fusion — the standard, tuning-free way to combine two ranked lists: [Cormack et al., SIGIR 2009](https://cormack.uwaterloo.ca/cormacksigir09-rrf.pdf)

The embedding model is not better than BM25. It is better at a different half of the queries — and a production system does not get to answer only half.