# Folding schemes for zkML, explained without the cryptography.

A zero-knowledge proof of a small neural network is a few-million-constraint circuit that proves in seconds. A zero-knowledge proof of a real transformer is the same idea scaled until it stops working: the circuit grows with every operation the model performs, and a model that performs billions of operations produces a circuit nobody can prove in one piece. The prover runs out of memory, or out of patience, or out of both.

That is the wall zkML has been stuck behind, and our [five-zkML-libraries benchmark](/blog/which-zkml-ships/) put hard numbers on it — a 3.4M-parameter MLP is already at the edge of routine practicality, and a transformer is an order of magnitude worse per layer. The frontier moved in 2025: Lagrange's DeepProve produced [the first proof of a full LLM inference](/blog/deepprove-first-proven-llm/), GPT-2. But "GPT-2, slowly, once" is the frontier, not the routine case, and the reason is the monolithic-proof problem — proving the whole thing at once.

Folding schemes are the most promising lever against that problem. They come from a line of cryptography research — Nova, and a family of successors — and the cryptography is genuinely intricate. This post is not about the cryptography. It is about the _idea_, which is simple, intuitive, and worth understanding even if you never read a paper: prove a long, repetitive computation one step at a time, folding each step into a running summary instead of building one giant proof at the end.

## The scaling wall is a one-shot problem

Be precise about what breaks. A zkML proof works by compiling the model's computation into an arithmetic circuit — a vast system of equations that is satisfiable only if the model was evaluated honestly — and then proving that the system is satisfied. The cost of that proof scales with the size of the circuit, and the circuit's size scales with the number of operations: every multiply, every addition, every approximated nonlinearity is more equations.

For a small model this is fine. For a large one it is not, and the failure is not gradual. The prover has to hold the whole circuit in memory at once to prove it in one shot, and a frontier-scale model's circuit does not fit. You cannot rent your way around it with a bigger machine for an afternoon, because the cost is super-linear and the memory ceiling is real. The problem is not "proving is expensive." The problem is _proving the whole thing at once_ is expensive. Those are different sentences, and folding attacks the second one.

## ML inference is a loop

Here is the structural fact that makes folding fit. A neural network is not billions of distinct, unrelated operations. It is a small number of operations applied over and over.

A transformer is a stack of identical blocks — the same attention-and-feedforward pattern, repeated for every layer. Run the model and you execute that block, then execute it again, then again, dozens of times. Generate a sequence and there is a second loop on top: the whole stack runs once per output token, so a 100-token completion is the layer stack executed a hundred times over. Inference is loops inside loops inside loops. The computation is enormous, but it is enormous because it is _repetitive_, not because it is varied.

That distinction is the whole opportunity. A computation that is large because it does a thousand different things is genuinely hard to compress. A computation that is large because it does the _same thing_ a thousand times has structure — and structure is something a proof system can exploit. Loops are exactly what folding was built for.

## The idea of folding, without the math

Incrementally verifiable computation — IVC — is the goal folding serves, and the name says what it is. Instead of proving a long computation after it finishes, you prove it _incrementally_, as it runs, keeping a small proof up to date step by step.

The mechanism is an accumulator. Picture the computation as a sequence of identical steps — step 1, step 2, step 3, on to step N. At each step you have a claim: "this step was performed correctly." A folding scheme takes the claim for the new step and the running summary of all the steps before it, and _folds_ them together into a single new summary of the same fixed size. Fold, advance, fold, advance. After N steps you hold one small object that stands in for the correctness of all N steps.

Three properties follow, and they are the reason folding matters:

- **The per-step cost stays roughly flat.** Folding step N into the accumulator costs about the same as folding step 2. The work to extend the proof does not grow with how many steps you have already done. A million-step computation has the same per-step proving cost as a thousand-step one.
- **The memory stays bounded.** You never hold the whole computation at once. You hold one step and one accumulator. The monolithic circuit that did not fit in memory never has to exist; you process the computation as a stream.
- **There is no giant proof at the end.** Folding defers and compresses rather than accumulating a mountain of proof data. At the very end you produce one short final proof — of the accumulator, not of the whole computation — that a verifier checks quickly. The end-of-run artifact is small and constant-size, not a monument that grew with every step.

That last point deserves a caveat, because precision matters here: folding does not literally eliminate the final proving step. A long folding run still ends with a single succinct proof — practitioners call it the "decider" — that proves the accumulated summary is valid. The win is that this final proof is of fixed, small size regardless of how many steps were folded. You trade one impossible proof for N cheap folds plus one cheap proof. That is the trade, stated honestly.

## A plain-language analogy

Think about verifying a long recipe.

The monolithic approach: cook the entire meal, then, to prove you followed the recipe, re-derive the whole thing from scratch in front of an inspector — re-run every step, start to finish, in one sitting. For a three-step recipe, tolerable. For a recipe with ten thousand steps, the inspection is as much work as the cooking, and you need a kitchen big enough to lay out all ten thousand steps at once.

The folding approach: keep a running checksum. After each step you update a single short tally that summarizes "every step so far was done right." Step ten thousand updates the tally exactly as cheaply as step two did. At the end you hand the inspector the final tally and one short argument that the tally was maintained honestly. They check that, and they are done — they never re-walk the ten thousand steps, and you never needed a kitchen to hold them.

A running checksum over a stream of identical steps. That is folding. The model's layer-by-layer, token-by-token loop is the stream; the accumulator is the checksum; the final decider proof is the short argument that the checksum was kept honestly.

## The Nova lineage, at a high level

Folding schemes are not one construction but a short family tree, each member adding a capability. The non-cryptographic version of the lineage:

**Nova** (Kothapalli, Setty, and Tzialla, 2022) is the foundation. It introduced the folding scheme as a primitive and used it to build IVC: a way to prove a long sequence of identical steps where, in the authors' framing, the prover's work to extend the proof does not depend on how many steps have run, and the verifier's work does not grow with the number of steps either. Nova is the existence proof that the running-checksum idea works.

**SuperNova** (Kothapalli and Setty, 2022) relaxes the "identical steps" requirement. Real programs do not run one instruction over and over; they branch — different steps do different things. SuperNova lets the loop have a _menu_ of step types and, crucially, makes the cost of proving a given step proportional only to the step that was actually taken. You do not pay for the instructions you did not use. For a model that mixes layer types, that à-la-carte costing matters.

**HyperNova** (Kothapalli and Setty, 2023) widens the kind of step a fold can describe. It works over a more expressive constraint format (called CCS) that captures high-degree relations directly, and it folds them without the awkward extra "cross-terms" earlier schemes needed. In plain terms: HyperNova lets each step express richer logic per fold, which suits the non-linear operations — the softmaxes and normalizations — that make transformers hard.

**ProtoStar** (Bünz and Chen, 2023) is a general recipe for turning a broad class of protocols into folding schemes, and it brought efficient support for _lookups_ — the technique of answering "is this value in this table" cheaply. Lookups are how modern proof systems handle the non-arithmetic operations ML is full of, and ProtoStar's folding cost for them does not grow with the table's size.

The thread through all four: start with "fold a loop of identical steps," then progressively handle branching steps, richer steps, and table-driven steps — each generation making folding fit a wider range of real computation, ML's awkward operations included.

```
   Nova        fold a loop of identical steps
     |
     v
   SuperNova   steps may branch; pay only for the step taken
     |
     v
   HyperNova   richer per-step logic; high-degree, no cross-terms
     |
     v
   ProtoStar   a general recipe; cheap lookups for non-arithmetic ops
```

## Why folding fits zkML specifically

Most computations you might want to prove are an awkward fit for folding, because most computations are not cleanly repetitive. zkML is the rare exception, and the fit is unusually good.

A model's forward pass is a loop with a tiny body and an enormous trip count: the same block, the same handful of operation types, run layer after layer and — in generation — token after token. That is precisely the shape folding's per-step-flat cost was designed to compress. Where a monolithic prover sees one circuit too large to hold, a folding prover sees one modest step circuit and a long, cheap fold over it. The repetition that makes a large model expensive to prove all at once is the same repetition that makes it _cheap to prove a step at a time_. The model's structure and the proof technique's best case are the same structure. That alignment is why every serious roadmap for scaling zkML has folding somewhere in it.

## What folding does and does not solve

Now the discipline, because folding is a real lever and also not a magic one.

What it does: it removes the one-shot memory ceiling, flattens per-step proving cost, and keeps the final proof small. Concrete evidence that this is more than theory — accumulation-based zkML compilers that fold repeated blocks have reported real speedups, on the order of bringing a GPT-2 proof from over an hour down to roughly ten minutes, and a larger model's proof down several-fold. The curve is bending.

What it does not do: make large-model zkML cheap _today_. Ten minutes for GPT-2 is a milestone, not a product, and GPT-2 is two to three orders of magnitude smaller than a frontier model. Folding lowers the per-step cost; it does not abolish it, and a frontier model has an astronomical number of steps. The non-arithmetic operations — softmax, layer norm, modern activations — still get approximated, and each approximation is its own cost and its own correctness burden, folding or not. Folding moves the line. It does not move it to where a frontier transformer is provable on demand.

Which means the practical advice has not changed. Our [opML-or-zkML decision tree](/blog/opml-or-zkml/) is still the one to follow: if your model is small and fixed, zkML ships now; if it is a large transformer doing real reasoning, you cannot prove it in 2026, and the answer is a TEE, or opML when your trust model tolerates a challenge window. Folding is the reason that answer has an expiry date — but the expiry date is years out, plural, and an honest practitioner does not pick one.

## Where it is heading

The direction, though, is not in doubt. Folding sits inside two compounding trends: the lineage itself keeps maturing — each generation handles more of what real ML computation actually does — and folding is increasingly the recursion layer inside general-purpose zkVMs, which is the other route to proving an ML program. Stack folding's flat per-step cost on top of the hardware acceleration that [the DeepProve analysis](/blog/deepprove-first-proven-llm/) traced, and the ceiling on "how large a model can be proven at tolerable cost" rises every year.

It rises predictably and it rises slowly — this is grind, not breakthrough. But the monotonic part is the part to internalize: a model that is out of reach this year will be in reach some year, and folding is a large share of why. The right posture is the one the zkML posts keep landing on — build for what ships today, and treat folding-plus-hardware as a migration trigger you monitor, not a slot you leave empty hoping.

## Reading list

- The [Nova paper](https://eprint.iacr.org/2021/370) (Kothapalli, Setty, Tzialla) — the origin of folding schemes and IVC; the introduction is readable even if you skip the proofs.
- The [microsoft/Nova](https://github.com/microsoft/Nova) implementation — the reference codebase, and the fastest way to see folding as something concrete rather than abstract.
- Our [five-zkML-libraries benchmark](/blog/which-zkml-ships/) — where zkML actually stands today, and the small-model ceiling folding is built to lift.
- The [first-proven-LLM analysis](/blog/deepprove-first-proven-llm/) — DeepProve, GPT-2, and why "impossible to expensive" is the move that folding and hardware then erode.

A monolithic proof asks you to re-derive the whole meal at the end. Folding keeps a running checksum and proves each step as you cook. zkML's computation is a loop — so the second approach is the one that scales, eventually.