# Confidential RAG: keep the context secret, not just the query.

A law firm wants a RAG system over its matter files — contracts, deposition transcripts, privileged memos. The vendor's pitch leads with privacy: the user's question is sent over TLS, never logged, never used for training. The firm's general counsel reads that, nods, and signs. The system goes live.

Here is what the contract did not say. To answer a question, the system embeds every document in the corpus, stores those embeddings in a managed vector database, and at query time hands the top retrieved chunks to a hosted LLM. The query was protected. The corpus — the actual privileged material — was embedded by a model on someone else's GPU, sits in a vector store on someone else's infrastructure, and is shipped chunk by chunk into a model the firm does not operate. The sensitive asset was never the question. It was the documents, and the documents spent the whole pipeline in plaintext on hardware the firm does not control.

This is the common shape of "private RAG," and it protects the wrong thing. This post is about the other design — confidential RAG — where the corpus is treated as the asset it is, and the embeddings, the vector store, and the retrieved chunks are all in scope.

## What "private RAG" usually means

Strip the marketing and most "private RAG" offerings reduce to three guarantees, all about the query:

- **The query is encrypted in transit.** TLS between the user and the service. This is table stakes for any web service and says nothing specific about RAG.
- **The query is not retained.** It is not logged, not stored, not added to a training set. A policy commitment, worth having, and not a technical control.
- **The query is not used to train models.** Again a policy, enforced by contract and trust, not by architecture.

Every one of those is about the user's input. Not one is about the corpus. A vendor can honor all three with complete sincerity and still embed your documents on shared infrastructure, store the vectors in a multi-tenant database, and forward your retrieved chunks to a third-party model API. "Private" described the transport and the retention policy for one short string. It never described the documents — and in a RAG system over confidential material, the documents are the entire reason privacy is on the table.

The query is a few words a user typed. The corpus is the firm's privileged files, the hospital's patient records, the company's unreleased strategy. Protecting the first and exposing the second is not a privacy design. It is a privacy design pointed at the cheap asset.

## The corpus is the asset — and it leaks three ways

Be precise about where the corpus is exposed. A RAG pipeline has three places the documents exist outside the source system, and each is a distinct leak.

```
  SOURCE                    INGEST                  STORE                 SERVE
  ┌──────────┐   plaintext  ┌────────────┐  vectors ┌────────────┐ chunks ┌──────────┐
  │ documents │ ───────────►│ embedding   │ ───────►│ vector      │ ──────►│ LLM       │
  │ (secret)  │             │ model       │         │ database    │        │ provider  │
  └──────────┘             └────────────┘         └────────────┘        └──────────┘
                                  │                       │                    │
                            sees plaintext           holds vectors        sees retrieved
                            of every doc             (invertible)          chunks verbatim
                                  ▼                       ▼                    ▼
                              LEAK 1                  LEAK 2               LEAK 3
```

**Leak one — the embedding model sees plaintext.** To turn a document into a vector, something has to read the document. If the embedding model is a hosted API, every chunk of every document is transmitted, in plaintext, to that provider. People reason about the LLM call as the sensitive step and forget that ingestion already shipped the entire corpus to an embedding endpoint, often months before the first query.

**Leak two — the vector database is a copy of your data.** This is the leak teams most underrate, because a vector feels like a hash — a lossy, scrambled, one-way digest. It is not. Embeddings are invertible. Morris et al., in the 2023 paper [_Text Embeddings Reveal (Almost) As Much As Text_](https://arxiv.org/abs/2310.06816), built `vec2text`, a method that reconstructs input text from its embedding by iteratively generating a guess, re-embedding it, and correcting toward the target vector. On short inputs the reconstruction is near-verbatim — they report recovering 92% of 32-token text inputs exactly, and demonstrate pulling full patient names out of embedded clinical notes. The attack weakens on longer text, but the direction is settled: a dense embedding is not a one-way function, and a vector database is a recoverable copy of your corpus wearing a numeric disguise. Whoever can read that database can, with enough effort, read your documents.

**Leak three — the LLM provider sees the retrieved chunks.** This is the leak even careful teams accept by default. At query time the pipeline retrieves the most relevant chunks and places them in the model's context window. If the model is a hosted API, those chunks — the most sensitive, most on-point passages in the entire corpus, selected precisely because they answer a real question — are transmitted verbatim to the provider. Retrieval does not send random text. It sends the paragraph that matters most, every time.

Three leaks, three different parties, and a "private RAG" pitch that addressed none of them because it was busy encrypting the query.

## A threat model — who sees what

"Make RAG private" is not a task until you name the adversary. The honest exercise is a table: every component, what plaintext it touches, and whether you operate it.

| Component       | Plaintext it sees                          | Operated by            | If compromised                           |
| --------------- | ------------------------------------------ | ---------------------- | ---------------------------------------- |
| Embedding model | Every chunk, at ingest time                | Often a vendor         | Whole corpus leaks, retroactively        |
| Vector database | Embeddings — invertible to text            | Often a vendor         | Corpus recoverable via inversion         |
| LLM provider    | Retrieved chunks, verbatim, per query      | Almost always a vendor | The most sensitive passages leak         |
| Infra operator  | Anything in RAM, disk, network on the host | Cloud provider         | Everything not inside a hardware enclave |

Read down the "operated by" column. In the default managed-RAG stack, every row is somebody else. The firm that signed the contract operates none of the four components that touch its privileged documents in plaintext. Confidential RAG is the work of changing that column — moving each component either onto infrastructure the firm controls, or inside an environment where the operator provably cannot see in.

Note what is _not_ this post's topic. Stripping personal data out of the documents before they ever enter the pipeline is a real and separate control — that is redaction, and it is its own engineering, covered in [PII redaction that does not wreck retrieval](/blog/pii-redaction-retrieval/). Redaction reduces how much sensitive data is in the corpus at all. Confidential RAG protects the sensitive data that remains. You want both; this post is the architecture half.

## Designs that protect the corpus

Each leak has a fix, and the fixes are independent — you can close one, two, or all three, at rising cost and rising assurance.

### Self-host the embedding model

The cheapest and highest-leverage move. Open-weight embedding models are strong enough for most retrieval workloads, and a self-hosted embedding model means the plaintext of your corpus never leaves your infrastructure at ingest time. Leak one closes outright. This is not exotic — it is a model server you run, on a GPU you rent or own, and it removes an entire vendor from the plaintext path. If you do one thing on this list, do this one.

### Self-host the vector store — and treat it as sensitive

Running your own vector database keeps the embeddings on infrastructure you control, which closes leak two against an _external_ vendor. But the invertibility result reframes what the vector store is: not an index, a derived copy of your corpus. So it inherits the corpus's classification. Encrypt it at rest. Put it behind the same access controls as the source documents. Do not replicate it to a region or an account your source data would never be allowed in. A team that locks down the document store and leaves the vector store wide open has locked the front door and left a full transcript by the window.

### Run inference where the provider cannot see in — TEEs

Leak three is the hard one, because the whole point of using a frontier model is that you did not train it and cannot host its weights. You need the model's quality and you cannot bring the model in-house. That is exactly the problem a Trusted Execution Environment solves: a hardware-isolated region of a CPU or GPU where code runs on encrypted memory, and the infrastructure operator — the cloud provider, the OS, other tenants — cannot inspect what is inside. Retrieved chunks are decrypted only inside the enclave, used for the generation, and never visible to the host.

The performance objection that used to kill this idea no longer holds. A 2025 study from ETH Zurich, [_Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs_](https://arxiv.org/abs/2509.18886), benchmarked full LLM inference inside Intel's CPU TEEs (SGX, TDX) and NVIDIA H100 GPU confidential computing, and measured CPU-TEE overhead under 10% on throughput, with GPU-TEE overhead of roughly 4–8%. They also ran an entire RAG pipeline — an Elasticsearch retrieval step plus generation — inside a TEE and reported about 6–7% overhead. Confidential inference is not free, but a single-digit tax is a tax most confidential workloads can pay.

### The honest gap — FHE is not the answer here yet

Someone always asks why not Fully Homomorphic Encryption — compute on the encrypted chunks directly, so the provider never decrypts anything, and the trust shifts from a chip vendor to mathematics. The instinct is right; the timing is wrong. FHE inference on a transformer remains orders of magnitude slower than plaintext — far outside the latency budget of an interactive RAG system in 2026. We laid out that trade in full in [FHE vs TEE for ML](/blog/fhe-vs-tee-for-ml/): FHE buys a cleaner trust model and pays for it with a performance penalty that, for LLM-scale inference, is still prohibitive. For confidential RAG today, TEE is the practical instrument for leak three, and FHE is the one to watch — not the one to ship.

## A reference architecture

Assemble the pieces into a stack a firm could actually deploy.

```
  ┌─────────────────────── infrastructure the firm controls ──────────────────────┐
  │                                                                                │
  │   documents ──► self-hosted ──► encrypted vector ──► retriever ──► top-k        │
  │   (in firm's   embedding model   store (at rest,      (in firm's   chunks       │
  │    own store)  (open-weight,     same ACL as docs)    infra)       │           │
  │                 on firm's GPU)                                     │           │
  └────────────────────────────────────────────────────────────────────┼──────────┘
                                                                         │
                                            chunks travel encrypted ─────┤
                                                                         ▼
                              ┌──────────── TEE on provider hardware ───────────┐
                              │  attested enclave: decrypts chunks, runs the    │
                              │  frontier model, returns the answer. Operator   │
                              │  sees an enclave running — not what is inside.  │
                              └─────────────────────────────────────────────────┘
```

The flow: documents live in the firm's own store and are embedded by an open-weight model on the firm's own GPU — leak one closed. The vectors land in a self-hosted store, encrypted at rest, under the same access controls as the source documents — leak two closed against external parties, and the store correctly treated as a corpus copy. At query time the retriever runs on the firm's infrastructure; only the top-k chunks leave, encrypted, bound for an _attested_ TEE on the provider's hardware. The firm verifies, by attestation, the exact code the enclave is running before sending anything. The model generates inside the enclave; the operator sees a sealed box doing work, not the privileged paragraphs inside it — leak three closed.

What did the firm give up? It self-hosts two components it might have rented, and it accepts a single-digit latency tax on inference. What did it get? Every plaintext-touching component is either on infrastructure it controls or inside an enclave it has attested. The "operated by" column of the threat-model table no longer reads "somebody else" all the way down.

## Honest limits

Confidential RAG closes the three leaks. It does not make the system unconditionally private, and a design that oversells itself is worse than one that is clear about its edges.

- **TEEs rest on a hardware-trust assumption.** An enclave is only as sound as the chip vendor's implementation, and TEEs have a documented history of side-channel breaks followed by patches. You are trading "trust the cloud operator" for "trust the silicon vendor and their patch cadence" — a better trade for most threat models, and not a trust-free one. NVIDIA's H100 confidential computing, in particular, does not encrypt GPU memory the way a CPU TEE encrypts RAM; know your platform's exact boundary.
- **Embedding inversion is a moving target, on both sides.** Inversion is easier on short chunks and harder on long ones; defenses that perturb embeddings to resist reconstruction exist but trade away retrieval quality, and the research is active. Do not treat "embeddings are invertible" as a solved problem in either direction — treat the vector store as sensitive and move on.
- **Self-hosting moves the burden, it does not delete it.** A self-hosted embedding model and vector store are now _your_ systems to patch, monitor, and access-control. You removed a vendor from the trust model and added an operational obligation. That is usually the right trade for a confidential corpus; it is still a trade.
- **Metadata and query patterns leak even when content does not.** Which documents get retrieved, how often, in response to what cadence of queries — that traffic shape is observable at the infrastructure layer and can disclose plenty without a single chunk being read. Confidential RAG narrows the content leak; it does not erase the side channels around it.

None of these undo the design. They mark its edges, and a confidential-RAG deployment should be specified against them — not pitched as if they were not there. The firm that signed the first contract was not wrong to want privacy. It was sold a control aimed at the query. The corpus is the asset; point the controls at the corpus.

## Reading list

- Morris et al., [Text Embeddings Reveal (Almost) As Much As Text](https://arxiv.org/abs/2310.06816) — the paper that built `vec2text` and established that a vector store is a recoverable copy of your corpus.
- [Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs](https://arxiv.org/abs/2509.18886) — the 2025 ETH Zurich study with the single-digit overhead numbers for inference and for a full RAG pipeline inside a TEE.
- Our own [FHE vs TEE for ML](/blog/fhe-vs-tee-for-ml/) — why TEE is the practical instrument for confidential inference today and FHE is not, with the performance numbers behind that call.
- [You don't have a RAG problem, you have a chunking problem](/blog/chunking-not-rag/) — because a confidential RAG system still has to retrieve well, and a secret corpus chunked badly is a private system that gives wrong answers.

Private RAG that protects the query is protecting the cheap half. The documents are the asset. Encrypt the store, self-host the embedder, attest the enclave — and keep the context secret, not just the question.