# Stop OCR-ing your PDFs: retrieve on the page, not the transcript.

A 10-K filing arrives as a PDF. Three hundred pages, dense with tables — segment revenue, a debt maturity schedule, a five-year selected-financial-data block — plus a handful of charts and a few pages of two-column risk-factor prose. Someone wants a RAG system over a thousand of these.

The pipeline almost everyone reaches for: run the PDF through OCR, reconstruct the reading order, chunk the resulting text, embed the chunks. It will work for the prose. It will quietly destroy everything else. And the part it destroys is the part the filing exists to communicate.

This essay is about the alternative — embedding the page image directly, with no OCR step at all — and the narrower question of when that is the right call.

## The OCR tax

A PDF→text→chunk pipeline is three lossy transforms stacked on top of each other, and the loss compounds.

The first transform is OCR itself. Modern OCR is good on clean body text and far less good on everything else: small-print footnotes, rotated table headers, figures with embedded labels, low-contrast scans, stamps, handwriting. Every character it gets wrong is now in your index as fact, and no downstream step recovers it — the embedder faithfully embeds the error, the retriever matches on it, the generator cites it.

The second transform is layout reconstruction. A PDF has no notion of "reading order"; it is a bag of positioned glyphs. To produce a linear text stream, the pipeline has to guess which glyphs belong to which column, which block is a heading, where a table starts and stops. On a two-column risk-factors page, a guess that interleaves the columns produces text where every other sentence is from a different argument. The words are all correct. The meaning is gone.

The third transform is the one [the chunking essay](/blog/chunking-not-rag/) is entirely about — slicing the reconstructed text on a length budget. But notice what has already happened before the chunker runs. A table has been flattened into a stream of numbers with the column headers detached, so "412" no longer sits under "FY2024" next to "Net revenue." A chart has been reduced to its axis labels, or dropped entirely, because a bar chart has almost no extractable text and what little it has is meaningless once linearized. The chunker then cuts that damaged stream further.

Stack the three and the failure is structural. A question like "what was the segment operating margin in the most recent year" depends on a row, a column, and a header intersecting on the page. Flatten the page to text and that intersection is the first thing to go. The retriever was never the problem. The corpus was destroyed before the retriever saw it.

## What ColPali does

ColPali — published as [arXiv 2407.01449](https://arxiv.org/abs/2407.01449), "ColPali: Efficient Document Retrieval with Vision Language Models" — removes the OCR step entirely. It does not turn the page into text. It treats the page as an image and embeds the image.

The mechanism is worth being precise about, because "embed the image" undersells it. A naive image retriever encodes a whole page into one dense vector, the way CLIP encodes a photo. That single vector is too coarse for documents — a page has dozens of distinct regions, and one vector cannot represent a query that needs one specific table cell among them.

ColPali borrows its matching scheme from ColBERT. Instead of one vector per page, the vision-language model cuts the page into a grid and emits a vector per image patch. The query is likewise embedded as one vector per token. Relevance is then computed by **late interaction**: for each query-token vector, take its best-matching patch on the page (a max over patches), then sum those per-token maxima. This is the ColBERT MaxSim operator, applied across image patches rather than text tokens.

```
Query "segment operating margin 2024"
  -> tokens: [segment] [operating] [margin] [2024]
  -> one query vector per token

Page (one PDF page, no OCR)
  -> vision-language model
  -> grid of image patches
  -> one document vector per patch

Score(query, page) = sum over query tokens of
                       max over page patches of
                         dot(query_token_vec, patch_vec)
```

The consequence: the token `2024` can light up the patch covering a column header, `margin` can light up a patch over a row label, and the page scores high because the query's pieces each found a home somewhere on it — even though no contiguous text string on that page reads "segment operating margin 2024." The model is matching on what the page _looks like_. A table looks like a table to it. A chart looks like a chart. ColQwen2, a model in the ColPali family built on a Qwen2-VL backbone, is the variant we most often reach for as a current, well-supported encoder.

Crucially, indexing is now a single step. There is no OCR engine, no layout-reconstruction heuristic, no chunker. You render each page to an image and run one model. The entire fragile middle of the classic pipeline is gone, and with it every error that middle introduced.

## When it wins

Visual retrieval earns its cost on exactly the corpora where the OCR pipeline loses the most: documents whose information lives in their layout.

- **Financial filings.** 10-Ks, 10-Qs, prospectuses, fund fact sheets. Dense tables, footnoted numbers, the occasional chart. The data is the layout.
- **Scientific papers with figures.** A question about a result frequently resolves to a figure or a results table, not to a sentence. OCR drops the figure; visual retrieval keeps it.
- **Forms.** Tax forms, claims forms, intake forms. The meaning of a value is its position in the form. A flattened form is a list of orphaned strings.
- **Scanned contracts and historical documents.** Anything where OCR quality is already shaky — old scans, faxes, stamped or annotated pages. Visual retrieval never runs OCR, so it never inherits OCR's error rate.

There is a second, quieter advantage: scanned documents stop being a special case. A born-digital PDF and a photographed page are both just images to a visual retriever, where the classic pipeline needs a high-quality OCR pass — its least reliable component — to touch the scan at all. For a layout-heavy corpus, the gap is not a tuning detail. It is the difference between a system that can answer table questions and one that structurally cannot.

## When it doesn't

Visual retrieval is not a free upgrade, and on the wrong corpus it is a straight downgrade.

If your documents are **clean, born-digital text** — source code, API documentation, Markdown, plain prose with no meaningful layout — there is nothing for the visual model to recover. OCR of clean digital text is near-perfect; there is no error to avoid. Layout carries no signal a heading-aware text chunker doesn't already capture. You would be paying the full cost of image embeddings to retrieve information that a `bge`-class text embedder indexes faster, cheaper, and at least as accurately. For a code corpus, text retrieval is not the compromise. It is the right answer.

The costs you take on are three, and the next section prices each in detail: a vision-language model permanently in both the indexing and query paths, a multi-vector index that is far larger than the equivalent text index, and indexing throughput slow enough to make a re-index a real operation. None of this rules visual retrieval out. It does mean the decision is a genuine trade-off — index size, GPU spend, and indexing latency, bought in exchange for surviving layout — and one you only want to make where layout is actually load-bearing.

## Cost and infra

If you adopt visual retrieval, three line items dominate, and none of them is the query-time LLM.

The first is **index size**, and it is the one that surprises teams. One vector per patch, on the order of a thousand patches per page, across hundreds of thousands of pages, is a multi-vector index whose footprint is not comparable to the text index you may be mentally pricing against. Compression helps — patch vectors quantize aggressively, and a coarse single-vector pre-filter can shortlist pages before late interaction runs on the survivors — but design for it from the start rather than discover it in week three.

The second is the **GPU encoder**, which is not a one-time indexing cost. It stays in the query path forever, because every incoming query is embedded by the same VLM. Size it for steady-state query load, not just the initial backfill.

The third is **indexing throughput**, which sets how long a re-index takes. On a corpus that grows or revises — filings restated, contracts amended — re-indexing is recurring, and a slow encoder makes it a recurring cost.

This is also where production tooling matters, because a multi-vector index plus a VLM encoder plus a query path is real systems work, not a one-file script. Tools such as **Mixpeek** and **astra-multivector** exist precisely to carry the multi-vector indexing and serving layer ColPali-class retrieval requires; reaching for one is usually wiser than hand-rolling the index.

## Evaluation

Do not adopt visual retrieval on the strength of a demo, and do not reject it on the strength of one either. Measure it, against your text pipeline, on your corpus.

The harness is the same one [the chunking essay](/blog/chunking-not-rag/) argues you build before touching anything else: a golden set of questions whose right answers — and the specific page each answer lives on — are known. Build it once, in a JSON file in the repo, and every retrieval decision after it gets cheaper. For a visual-vs-text comparison, the golden set has one extra requirement: it must be deliberately stratified by question type, because the whole point is that the two systems fail in different places.

| Question type                       | What it probes                          |
| ----------------------------------- | --------------------------------------- |
| Plain-prose lookup                  | Where text RAG should already be strong |
| Table-cell lookup (row × column)    | Where OCR flattening does its damage    |
| Chart / figure question             | Where OCR typically drops the content   |
| Multi-column-page question          | Where layout reconstruction scrambles   |
| Scanned / low-quality-page question | Where OCR error rate spikes             |

Run the identical question set through both pipelines and report **precision@k per stratum**, not one blended number. A single average is exactly the trap [the chunking essay](/blog/chunking-not-rag/) warns about — it lets a strong prose score paper over a collapsed table score, or the reverse. The per-stratum split is the finding. The expected shape, for a corpus of this kind: the two systems land close on plain-prose lookups, and the visual pipeline pulls clearly ahead on the table, chart, multi-column, and scanned strata. If that pattern does not show up on _your_ corpus, your corpus is more text-shaped than you assumed — itself a useful, money-saving result.

Watch indexing cost and query latency on the same dashboard. The accuracy gain on layout-heavy strata is real; so is the GPU bill and the multi-vector index. A decision that ignores either half is not a decision.

## The hybrid recommendation

The framing that produces the wrong answer is "ColPali or text RAG, pick one for the whole corpus." Almost no real corpus is uniform. A document store is a mix of clean prose, dense filings, scanned contracts, and the occasional form, and the right retriever differs across that mix.

So route by document class — the same move [the chunking essay](/blog/chunking-not-rag/) makes for chunkers, applied one level up. Classify each document on the way in (file path, source system, a fast layout heuristic, a lightweight classifier — cheap signals are fine). Send layout-heavy classes to the visual index. Send clean-text classes to the text index. Query both and merge.

```
                  incoming document
                          |
                  classify by layout
                  /                  \
        layout-heavy                clean text
   (filings, forms, scans,        (code, API docs,
    papers w/ figures)             plain prose)
          |                              |
   ColPali-class visual           text embedder
   (multi-vector index)           (single-vector index)
          \                              /
                    query both, merge
```

This costs you a classifier and two indexes. It buys you the visual retriever's accuracy where layout is load-bearing and the text retriever's speed everywhere else — instead of overpaying for image embeddings on your code, or losing every table in your filings. The decision is per document class, not per system, and it is made with the eval harness, not with intuition.

A short decision path for a team weighing this:

1. **Is the corpus mostly clean, born-digital text — code, docs, plain prose?** Yes → text RAG; visual retrieval buys nothing. No → continue.
2. **Does answering questions depend on tables, charts, forms, or scanned pages?** Yes → visual retrieval is a genuine candidate. No → continue.
3. **Is the corpus a mix of both?** Yes → hybrid: classify, route, run both indexes. This is the common case.
4. **Have you priced the multi-vector index, the GPU encoder, and the re-index throughput?** No → do that before committing; index size is the line item that surprises teams.
5. **Have you measured both pipelines on a per-stratum golden set?** No → build it first. Without it, "visual retrieval is better" is a vibe, not a finding.

The OCR step was never a neutral preprocessing detail. It is a lossy transform you inserted into the most layout-rich part of your corpus and then forgot was there. Sometimes that loss is acceptable. Often, on the documents that actually matter, it is the whole problem — and the fix is to stop transcribing the page and start retrieving on it.

## Reading list

- [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) — the source paper. Read the late-interaction section closely; the multi-vector scoring is the whole idea.
- [Mixpeek](https://mixpeek.com/) — production tooling for multimodal and multi-vector retrieval pipelines.
- [The ColPali repository](https://github.com/illuin-tech/colpali) — the official ColVision code: training and inference for ColPali, ColQwen2, and ColSmol.
- [The chunking essay](/blog/chunking-not-rag/) — for the golden-set harness and the route-by-document-class pattern this post builds on.

Stop transcribing the page. Retrieve on it.