PII redaction that does not wreck retrieval.

A healthcare analytics team builds RAG over a corpus of clinical and operational documents. Legal is clear: personal data cannot reach the embedding model or the LLM. So the team does the obvious thing — they run a detector over every document and replace every name, date, identifier, and address with [REDACTED] before anything gets embedded. Compliance signs off. The system goes live.

Retrieval quality falls off a cliff. Questions that should land obvious documents return noise. A query about a specific patient’s medication history retrieves three unrelated discharge summaries. The team blames the embedder, swaps it, blames the retriever, adds a reranker. Nothing helps, because nothing they are touching is the problem. The problem is that they turned a sentence reading Dr. Alvarez adjusted the patient’s metformin dose on March 3 after the A1c result into [REDACTED] adjusted the patient's metformin dose on [REDACTED] after the [REDACTED] result — and then asked an embedding model to make that mean something.

Redaction was necessary. The way they redacted wrecked retrieval. This post is about the difference — how to strip PII from a corpus without destroying the signal the retriever runs on.

Redaction and retrieval pull in opposite directions

The tension is structural, not a tooling bug. Three forces, all working against you at once.

Embeddings encode the spans you are about to remove. An embedding model produces its vector from the whole input. The named entities — people, places, dates, identifiers — are not noise the model ignores; they are signal it encodes. Two documents about different patients with otherwise similar text get different embeddings substantially because of the entities. Strip the entities and you have deliberately removed a chunk of what made each document findable.

An empty placeholder carries no signal at all. [REDACTED] is a single token that means “something was here.” It is the same token whether it replaced a person, a hospital, a drug, or a date. Embed a sentence dense with [REDACTED] and the model sees a string of identical blanks where the distinguishing content used to be. You did not just remove information — you replaced it with a token that actively pulls unrelated documents together, because every over-redacted document now shares the same blanks.

Over-redaction strips the context around the PII, not just the PII. A detector tuned to miss nothing flags aggressively. It redacts the job title next to the name, the department next to the person, the diagnosis adjacent to the date. Each over-flag removes a word that was not sensitive and was load-bearing for retrieval. The corpus gets safer and blanker in the same motion.

Put the three together and naive redaction is not a privacy tax on retrieval. It is a direct attack on the exact features retrieval depends on. The fix is not “redact less” — under-redacting leaks PII, which is the failure you cannot accept. The fix is to redact in a way that removes the identity and keeps the signal.

The detectors, and how each one is wrong

Before fixing the placeholder, look at what does the detecting — because every detector has an error profile, and the errors land on retrieval and on privacy differently.

Approach	How it works	False positives	False negatives
Regex / pattern	Fixed patterns — SSNs, card numbers, emails	Low on structured PII	Misses anything unstructured (names)
NER-based	A trained model tags entity spans	Moderate; context-sensitive	Misses rare names, novel formats
LLM-based	Prompt a model to find and redact PII	Variable; prompt-dependent	Misses under distribution shift; non-deterministic

Read the two error columns as two different costs. A false positive — redacting something that was not PII — strips signal: the system gets blanker and retrieval degrades. A false negative — missing real PII — leaks: the system gets unsafe. They are not symmetric. In a privacy deployment a false negative is the unacceptable one, so detectors get tuned toward recall, which means more false positives, which means more signal stripped. That is the trap: the safe tuning is the one that hurts retrieval most.

No single detector is enough. Regex is precise on structured identifiers and blind to names. NER catches names and misses the unusual ones. LLM-based detection is flexible and non-deterministic — it will not redact the same document the same way twice, which is a problem we return to below. The practical answer is layered: regex for the structured, well-shaped identifiers where it is near-perfect, and an NER model for the unstructured entities it cannot pattern-match. Microsoft’s open-source Presidio is the common toolkit for exactly this shape — it combines pattern recognizers, NER (via spaCy, Stanza, or Hugging Face transformer models), and a context-aware enhancer that raises confidence on entities near supporting words. Layering does not eliminate the recall-versus-precision tension; it lets you apply the precise tool where it is precise and lean on the model only where you have no choice.

Preserve the signal: typed placeholders

Here is the single highest-leverage change, and it costs almost nothing. Stop redacting to [REDACTED]. Redact to a typed placeholder that names the entity category.

  naive:   [REDACTED] adjusted the patient's metformin dose on [REDACTED]
  typed:   [PERSON] adjusted the patient's metformin dose on [DATE]

The identity is just as gone in the second line — no name, no date, nothing that re-identifies anyone. But [PERSON] and [DATE] are different tokens, and they are tokens the embedding model has seen in meaningful contexts during training. The slot keeps its grammatical and semantic role: the embedding still encodes “a person did something to a medication on a date,” which is most of what made the sentence retrievable. [REDACTED] throws that away; [PERSON] keeps it. The privacy guarantee is identical — you are choosing between two redactions that hide the same thing, and one of them happens to leave the sentence interpretable.

This is not a fringe trick. The PII-redaction literature has converged on it: recent benchmark work such as PRvL, a 2025 study of LLM-based PII redaction, documents that placeholder choice measurably affects downstream quality, and structured, type-aware placeholders consistently outperform opaque blanks. Presidio’s anonymizer supports this directly — its replace operator substitutes a detected entity with a type tag rather than a generic mask. Typed placeholders are the default a competent redaction pipeline should start from, not an optimization to reach for later.

Preserve co-reference: consistent pseudonymization

Typed placeholders fix the per-span signal. They do not fix a second, subtler loss — co-reference. Consider a document that mentions the same person five times. Replace every mention with [PERSON] and the model can no longer tell that the five slots are the same person. A question like “what did the prescribing physician do after the second visit” needs the system to track one entity across the document, and a corpus of undifferentiated [PERSON] tokens has erased the thread.

The fix is consistent pseudonymization: map each distinct entity to a stable, non-identifying token, and reuse that token everywhere the entity appears.

  source:   Dr. Alvarez ... Alvarez later ... she ...
  redacted: [PERSON_1] ... [PERSON_1] later ... [PERSON_1] ...
            (and Dr. Brooks, elsewhere in the doc, becomes [PERSON_2])

Now [PERSON_1] is consistent within the document — the co-reference chain survives, the retriever and the LLM can both still reason about “the same physician across the visit,” and no real name is anywhere in the text. Presidio supports this through the anonymizer’s mapping capability; the pattern is sometimes called pseudonymization rather than redaction precisely because it substitutes a consistent alias instead of a blank.

One scoping decision matters and is easy to get wrong. The pseudonym map should be consistent within a document, and you should think hard before making it consistent across the corpus. A corpus-wide stable mapping — the same person is [PERSON_1] in every document — makes cross-document reasoning possible, and it also rebuilds a re-identification surface: a persistent pseudonym that links a person’s appearances across hundreds of documents is exactly the linkage an attacker wants. Per-document consistency keeps the co-reference benefit where retrieval needs it most and denies the cross-document linkage. If a use case genuinely needs corpus-wide consistency, that is a deliberate privacy decision to make with whoever owns the threat model — not a default to back into.

Redact before embed, not after

A frequent question: can the embeddings stay clean while we redact only what the LLM sees at generation time? The answer is no, and it follows directly from confidential RAG’s point about embedding inversion. If you embed the raw, un-redacted documents, the vector store now holds embeddings of text containing PII — and embeddings are invertible, so the vector store is a recoverable copy of the un-redacted corpus. Redacting later, at the generation step, does nothing for the data already sitting in the index.

So redaction sits before the embedding model. The corpus that gets embedded is the redacted corpus; the vector store only ever holds vectors of redacted text; the chunks retrieved at query time are already redacted before they reach the LLM. Redact-before-embed is the only ordering that actually keeps PII out of the parts of the system that persist and leak.

Where redaction sits in the ingest pipeline

Place the stage concretely. Redaction is a transform between parsing and embedding, and its position relative to chunking is a real decision.

  parse ──► detect + redact ──► chunk ──► embed ──► index
            (regex + NER,        │
             typed placeholders, │
             per-doc pseudonyms) │
                                 └─► run detection on the WHOLE document,
                                     not on chunks: context for NER and
                                     for co-reference spans document-wide

Run detection on the whole document, before chunking, not on chunks after. Two reasons. NER and context-aware detection are more accurate with more surrounding text — a name at a chunk boundary, shorn of the sentence that identified it, is a name the detector is likelier to miss. And co-reference resolution for consistent pseudonymization needs to see every mention of an entity; a person mentioned in chunk 1 and chunk 7 has to receive the same pseudonym, which is only possible if detection saw the document whole. Chunk the already-redacted document. Never embed a chunk that was redacted in isolation.

Measuring it: two numbers, not one

Redaction quality is two metrics, and a pipeline that reports one is hiding the other.

Retrieval quality on a redacted golden set. Build the evaluation set the chunking and faithfulness essays argue for — graded questions with known answer documents — and run it twice: once against an index built from un-redacted documents, once against the redacted index. The gap is the retrieval cost of your redaction strategy, in the metric you already trust (precision@k, recall@k). Naive [REDACTED] will show a large gap; typed placeholders plus per-document pseudonymization should show a small one. If you cannot quantify the gap, you cannot claim your redaction is retrieval-safe — you are guessing.

PII leakage rate. Hold out a labelled set of documents with every PII span marked, run the pipeline, and measure how many real spans survived into the redacted output. This is the recall of your detector stack, stated as the number that actually matters: leaked entities per thousand. It is the hard constraint — a leakage rate above your compliance threshold fails, no matter how good retrieval looks.

The two numbers are in tension by construction, and that is the point of measuring both. Tuning detectors toward higher recall drives the leakage rate down and the retrieval gap up. The job is not to maximize either in isolation — it is to drive the leakage rate under the compliance bar and then recover as much retrieval quality as possible underneath that ceiling, mostly by spending the typed-placeholder and pseudonymization techniques above rather than by under-detecting. A single “redaction accuracy” number lets a team dodge that trade. Two numbers force it into the open.

The checklist

Before a redacted corpus reaches the embedding model:

Redaction runs before embedding — the vector store only ever holds vectors of redacted text.
Detection is layered — regex for structured identifiers, an NER model for unstructured entities; neither alone.
Detection runs on the whole document, before chunking, so context and co-reference survive.
PII is replaced with typed placeholders ([PERSON], [DATE]), never an opaque [REDACTED].
The same entity maps to a consistent pseudonym within a document; corpus-wide consistency is a deliberate decision, not a default.
Retrieval quality is measured on a redacted-vs-unredacted golden set, and the gap is a known number.
PII leakage rate is measured on a labelled hold-out set, and it sits under the compliance threshold.

Seven lines, and the through-line of all of them is the same: redaction has to remove identity without removing signal, and you only know whether it did by measuring both at once.

Reading list

Microsoft’s Presidio — the open-source detection and anonymization toolkit: pattern recognizers, NER, context-aware enhancement, and the replace operator behind typed placeholders.
PRvL: Quantifying the Capabilities and Risks of Large Language Models for PII Redaction — a 2025 benchmark of LLM-based redaction, including how placeholder choice moves downstream quality.
Our own Confidential RAG — why redaction is only half of corpus privacy, and what protects the sensitive data redaction leaves behind.
You don’t have a RAG problem, you have a chunking problem — the golden-set discipline you need before you can measure a redaction strategy at all.

Redaction is necessary and naive redaction is self-defeating. Strip the identity, keep the signal — typed placeholders, per-document pseudonyms, and two numbers that keep you honest.

PII redaction that does not wreck retrieval.

Redaction and retrieval pull in opposite directions

The detectors, and how each one is wrong

Preserve the signal: typed placeholders

Preserve co-reference: consistent pseudonymization

Redact before embed, not after

Where redaction sits in the ingest pipeline

Measuring it: two numbers, not one

The checklist

Reading list

Indirect prompt injection, by the numbers.

Your learning-rate schedule silently overrides your data-curation decisions.

Approving an agent's action is not authorizing it.

Tell us about it.

Got it.