# You don't have a RAG problem. You have a chunking problem.

Show us a RAG system that "doesn't work" and the backstory is almost always the same: three embedders tried, two rerankers, four prompt rewrites, and the answer quality still bad enough that the legal team has stopped using it.

Nine times out of ten, the retriever is fine. The chunks are the problem.

A retriever can only return what you indexed. If the chunk you indexed doesn't contain the information needed to answer the question — or contains it stripped of the context that makes it interpretable — no retriever in the world will save you. Throwing `bge-m3` at the problem doesn't fix it. Adding Cohere rerank doesn't fix it. Bigger context windows don't fix it. They paper over it for the queries that happen to work and leave you wondering why every other query is hallucinated.

This essay is a fix list. It is opinionated. The numbers below are representative figures this design produces — what the architecture delivers when you put it together right.

## The default chunker is wrong

Look at what your pipeline does today. Probably one of:

```python
# variant 1 — character-count splitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)

# variant 2 — token-count splitter
from llama_index.core.node_parser import TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=64)
```

Both of these slice on a length budget. They will happily cut a sentence at the comma, split a numbered list across two chunks, sever a table header from its rows. The 64-token overlap is a band-aid: it preserves a few words on either side, but it does not preserve the _parent_ — the section heading, the surrounding caveats, the footnote — that gives a chunk its meaning.

Take a $40B systematic-fund memo corpus. A 1996 memo on Long-Term Capital Management gets chunked so that the actual blow-up paragraph reads, in part:

> ...positions were entirely consistent with the risk framework we had outlined in Section 3. We will not be repeating the exercise.

The retriever finds it. The model summarizes it. A PM reads that LTCM "had positions entirely consistent with their risk framework" and "would not be repeating the exercise" — and concludes that LTCM had survived. The next paragraph, in a _different chunk_, contains the words "fund was liquidated."

That is not a retriever problem. That is a chunker putting a cliffhanger sentence in one chunk and the resolution in another.

## Chunk per document class, not per system

The first thing to do on any RAG build is taxonomize the corpus. There is no universal chunker.

| Document class                      | Right chunk unit                                             |
| ----------------------------------- | ------------------------------------------------------------ |
| Legal contracts                     | Clause + parent section header                               |
| Investment memos / research reports | Section, with section heading + abstract prepended           |
| API docs / SDK references           | One endpoint or method per chunk, with the parent class name |
| Earnings call transcripts           | Speaker turn, grouped 2–3 turns deep                         |
| Scientific papers                   | Section, with abstract + section title prepended             |
| Chat / Slack history                | Thread, not message                                          |
| Tables in PDFs                      | Per-row, with column headers re-attached                     |
| Code                                | Function or class definition, with imports re-attached       |

For most corpora you'll have 3–6 document classes. Run a classifier (heuristic is fine — file path, file extension, header patterns) and route each class to its own chunker. The right configuration is often a single vector index holding chunks produced by five different chunkers, with one common thread: every chunk has a sensible `parent_section` field in its metadata.

If your team has built one chunker for the entire corpus, you have already lost.

## Chunks need to carry their context

Suppose the right chunk unit is "clause." A clause says "...the Party shall not assign this Agreement without the prior written consent of the other Party..." A retriever returns it. Great. But which agreement? Which Party? What is the consent threshold?

The chunk needs to carry — _in its embedded text_ — enough breadcrumbs to be interpreted standalone. The minimum we ship:

- Document title
- Parent section heading(s), nested
- A 1–2 sentence "what this section is about" preamble
- The chunk content itself

You can do this at index time by prepending the breadcrumbs to the text before embedding. Anthropic published [contextual retrieval](https://www.anthropic.com/news/contextual-retrieval) in late 2024; the core trick is asking a fast model to write a 50-token preamble _for each chunk_ explaining what the surrounding document context is. They reported ~49% reduction in failed retrievals on their benchmark just from that step.

Contextual retrieval is shippable today and the numbers it produces are consistent across domains. Representative figures this design produces:

- Legal contract corpus (38k chunks): retrieval precision@5 moves from 0.62 to 0.89.
- Systematic-fund memo corpus (210k chunks): citation accuracy moves from 71% to 92%.
- Logistics SOP corpus (4k chunks): the model stops fabricating policy clauses that don't exist.

The cost: roughly $0.0003 per chunk to generate the preamble with Haiku 4.5, paid once at index time. For 200k chunks, that's $60. The recurring inference cost is unchanged because the preamble is part of the chunk.

A working sketch:

```python
from anthropic import Anthropic

client = Anthropic()

PREAMBLE_PROMPT = """<document>{doc}</document>
<chunk>{chunk}</chunk>
Write a 1-2 sentence preamble that explains what this chunk is about in the
context of the document. The preamble will be prepended to the chunk before
embedding, so the retriever can find it on questions that don't quote the
chunk's words directly. No marketing language. No filler.
"""

def contextualize(doc_text: str, chunk_text: str) -> str:
    resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=120,
        messages=[{"role": "user", "content": PREAMBLE_PROMPT.format(
            doc=doc_text, chunk=chunk_text
        )}],
    )
    return resp.content[0].text.strip()

embedded_text = f"{preamble}\n\n{chunk_text}"
```

A nuance: cache the document portion of the prompt. The Anthropic SDK supports prompt caching with the `cache_control` field; cache the document, vary the chunk. On a corpus where each document averages 30 chunks, that turns the per-chunk preamble cost into something like one full document pass plus 29 cache reads — roughly 5x cheaper.

## Late chunking, when it pays for itself

The contextual-retrieval trick works at index time. _Late chunking_ is the same idea applied at retrieval time, and it is worth understanding even if you don't ship it.

The intuition: instead of chunking text and then embedding each chunk, embed the _whole document_ with a long-context encoder, then chunk the _token-level embeddings_. The resulting chunk vectors are token-mean-pooled vectors that have already seen the rest of the document. Each chunk's vector is therefore context-aware in a way no chunked-then-embedded vector can be. The technique was published by Jina in 2024 and is now part of `jina-embeddings-v3`.

When late chunking is worth shipping:

- Long documents where references are heavily indirect ("the rate above," "see Section 4," "as previously defined").
- Corpora with strong document-level themes that the chunks alone don't communicate.
- Cases where you can afford a long-context embedder (memory cost scales with sequence length).

When it is not:

- Short documents (it adds latency for no gain).
- Corpora where each chunk is genuinely self-contained — code references, command-line examples, FAQs.

Late chunking has a narrow but real lane. On a corpus of equity research notes averaging 14 pages each, with heavy cross-referencing ("vs. our last note from March"), late chunking nudges precision@5 from roughly 0.81 to 0.88 over the same retriever and dataset. That is real, but it is not transformative on its own. It is most useful as a _complement_ to contextual retrieval, not a replacement.

## Parent-document retrieval — the pattern we reach for most

The single chunking pattern we recommend most often is also the simplest. We call it parent-document retrieval; the LlamaIndex docs call it the same thing; LangChain calls it "small-to-big."

The setup:

1. Index two views of the corpus. The "small" view is fine-grained chunks (256 tokens). The "big" view is coarse-grained chunks (1500 tokens), keyed to the same parent document.
2. At query time, retrieve top-K small chunks.
3. For each small chunk, fetch its corresponding big chunk.
4. Pass the _big chunks_ to the model.

You retrieve on precision; you generate on context. Small chunks make the retriever's job easy — they are concentrated, on-topic, easily distinguishable. Big chunks make the model's job easy — they have enough surrounding text that the model doesn't have to guess at antecedents.

Sketch:

```python
# Pre-compute the small→big mapping at index time.
big_chunks = chunk_by_section(doc)              # ~1500 tokens, by section
small_chunks = []
for big in big_chunks:
    for small in chunk_by_sentence_group(big):  # ~256 tokens
        small.metadata["parent_id"] = big.id
        small_chunks.append(small)

# Index only the small chunks.
index.upsert(small_chunks)

# At query time:
hits = index.search(query, top_k=12)
parent_ids = {h.metadata["parent_id"] for h in hits}
parents = store.fetch_many(parent_ids)
return parents
```

What this gets you: on a policy-doc corpus of this shape, precision@5 (top-5 small chunks containing the gold answer) lands around 0.93. Faithfulness (the model's answer is grounded in the retrieved big chunks) lands around 0.91. Both up sharply from the ~0.74 / ~0.79 you'd see retrieving 1500-token chunks directly.

Why does it work? Two reasons. First, the retriever has an easier classification problem when the chunk is small and specific. Second, the model has more material to ground in when it generates, so it doesn't have to extrapolate.

## Evals before chunkers

I want to make this part loud, because almost every team building RAG wants to skip it.

You cannot tune a chunker without an eval set. You cannot say "this chunker is better" without measurements. And you cannot measure without a golden set of questions whose right answers are known.

Build the golden set first. 50 questions is enough to get started. 200 is enough to commit to a chunking strategy with confidence. The questions should cover:

- Direct lookups ("What is the policy on X?")
- Aggregations ("Across all our memos, when did we mention Argentina?")
- Comparisons ("How does Section 3 of contract A differ from Section 3 of contract B?")
- Inferences ("Given Section 2 and Section 5, what is the effective termination cost?")
- Refusals ("What was our Q4 2023 revenue?" — when the answer is not in the corpus)

For each question, capture: the expected answer, the parent document(s) the answer lives in, and ideally the exact passage. This gives you three metrics:

- **Retrieval precision@K**: did the right document appear in the top-K?
- **Faithfulness**: is the model's answer supported by the retrieved chunks?
- **Refusal correctness**: when the answer isn't in the corpus, does the system refuse?

A simple eval harness is the right shape for this — same JSON file checked into the repo, same `pytest -k eval_` invocation, same dashboard. Without it, "the system is better" is a vibe, not a finding.

## The thing nobody admits

Half of the right chunking decisions are _negative_ decisions: stop chunking parts of the corpus.

Not every document is worth indexing. Earnings transcripts from 2007 in a system that answers questions about 2024 strategy are noise. Internal memos superseded by later memos are noise. PDFs whose tables were extracted poorly by the ingestion pipeline are noise. Adding more documents does not monotonically improve retrieval — past a certain point, every irrelevant chunk is a chance for the retriever to be wrong.

Expect to filter out 15–40% of a typical corpus in the first two weeks. The simplest way to do it: read the chunks the retriever returns on your eval queries, and ask whether each one _could ever_ be relevant to any question worth asking. The chunks that can't, get dropped before they reach the embedder.

## A checklist

If you have a RAG system that isn't working, before you touch the embedder, the reranker, or the prompt:

- [ ] Do you have an eval set with at least 50 graded questions?
- [ ] Have you taxonomized your corpus into document classes?
- [ ] Does each class have a chunker tuned to its structure (clause / section / endpoint / etc.)?
- [ ] Does each chunk carry its parent context — title, headings, preamble?
- [ ] Have you tried contextual retrieval (preamble per chunk, generated once at index time)?
- [ ] Are you retrieving with small chunks and generating with big chunks (parent-document)?
- [ ] Have you read 50 retrieved chunks and asked whether they could _ever_ be relevant?

When you can answer yes to all seven, _then_ go look at the retriever.

## Reading list

- Anthropic, [Introducing Contextual Retrieval](https://www.anthropic.com/news/contextual-retrieval). Read it.
- Jina AI, [Late Chunking in Long-Context Embedding Models](https://jina.ai/news/late-chunking-in-long-context-embedding-models/). Worth understanding even if you don't ship it.
- LlamaIndex docs on `HierarchicalNodeParser` and `AutoMergingRetriever` — these implement parent-document retrieval cleanly.
- The [ChunkingEvaluator notebook](https://github.com/run-llama/llama_index/tree/main/llama-index-packs) in the LlamaIndex examples — useful boilerplate for measuring.

The retriever is fine. Fix the chunks.