# The first proven LLM: what DeepProve changes for zkML.

For years the honest answer to "can you prove an LLM ran correctly" was no. Not "expensive," not "slow" — _no_. You could prove a small MLP. You could prove a CNN if you were patient. But a transformer doing real autoregressive generation, end to end, attention layers and all, was outside the set of things a zero-knowledge proof system could express at any price.

In August 2025 that changed. Lagrange announced DeepProve-1, billed as the first zero-knowledge proof of a full LLM inference. The model was GPT-2. Not a frontier model, not even close — but a real transformer, proven from token-in to logits-out, with no part of the forward pass waved away.

Milestones get over-read in both directions. One camp says verifiable AI has arrived. The other says GPT-2 is a toy and nothing changed. Both are wrong. What actually happened is narrower and more useful than either: a problem moved from the _impossible_ column to the _expensive_ column. This essay is about what that move buys you, and what it does not.

## What DeepProve actually did

Start with the technical claim, stated precisely. DeepProve is Lagrange's zkML proving framework. DeepProve-1 produced a zero-knowledge proof that a specific GPT-2 forward pass was computed correctly: given a committed set of weights and a public input, the output logits are the genuine result of running that network, and the proof reveals nothing else.

The word doing the work is _transformer_. Our [zkML library benchmark](/blog/which-zkml-ships/) walked through five proving stacks on a 14-layer MLP, and that was not laziness — dense layers are the easy case. A zkML circuit's cost scales with the arithmetic operations it has to encode, and a transformer's operations are hostile to that encoding in a way dense layers are not.

Consider what an attention layer asks a proof system to swallow:

- **Large matrix multiplications** for the query, key, value, and output projections — every one a wall of multiply-accumulate constraints.
- **A softmax over the attention scores.** Softmax means exponentials and a division, neither a native arithmetic-circuit operation; they get approximated with lookup tables or piecewise polynomials, and every approximation is more constraints and a correctness argument of its own.
- **Layer normalization**, which needs a mean, a variance, and a reciprocal square root — another non-native function, another lookup-and-prove.
- **The autoregressive loop.** A generated sequence runs the whole stack once per token. Proving a 50-token completion is, roughly, proving fifty forward passes that chain together consistently.

Proving an MLP means proving dense layers and activations. Proving a transformer means proving all of that _plus_ softmax, plus layer norm, plus residual connections threaded through every block, plus the loop — and proving they compose. DeepProve-1 is the existence proof that the whole stack can be expressed end to end. Before it, "prove a transformer" was a research aspiration. After it, an engineering line item with a cost next to it.

That is the entire significance of the result, and it is enough. You do not get a second "first." The category — provable transformer inference — is open.

## The fine print

Now the part the launch posts are quieter about — and the part an engineer has to internalize before building anything.

**GPT-2 is small.** The GPT-2 family runs roughly 100M to 1.5B parameters. A current frontier model is two to three orders of magnitude larger. zkML cost does not scale gently with parameter count — more parameters mean more constraints mean more proving work, and attention cost grows with sequence length on top of that. The gap between "we proved GPT-2" and "we proved a frontier model" is not one you close by renting a bigger machine for an afternoon.

**Proving is slow.** We are deliberately not putting a number on it — the milestone is real and the precise figure is not ours to invent. The honest characterization: generating a full-LLM proof today is a heavyweight offline computation, the minutes-to-hours class, not the milliseconds an inference takes. It is a batch job, not an interactive one.

**Proofs are large** relative to the tidy ~1 KB SNARKs the MLP benchmark produced for on-chain verification. Bigger proofs cost more to move, store, and verify — how much that matters depends on where the proof has to land.

So: a milestone, not a product — the first flight at Kitty Hawk, not a booked ticket. DeepProve-1 proves transformer inference is provable. It does not mean you can wrap your production LLM in a proof this quarter. Both statements are true at once, and an engineer has to hold both.

```
   IMPOSSIBLE          EXPENSIVE                PRACTICAL
        |                  |                        |
        |   DeepProve-1     |   folding schemes      |
        |   ──────────────> |   + GPU/FPGA proving   |
        |   (Aug 2025)      |   ──────────────────>  |
        |                   |   (years of work)      |
   "prove a            "prove GPT-2,           "prove a useful
    transformer"        slowly, once"           transformer,
                                                routinely"
```

The first arrow has been traversed. That is history now. The second arrow is the open engineering problem, and it is a long one.

## The trajectory

"Expensive" is a better problem than "impossible" for a specific reason: expensive problems respond to engineering, and two engineering vectors are already pointed straight at this one.

**Folding schemes.** A repeated computation — and an autoregressive transformer is the definitional repeated computation, the same block run over and over — is the ideal target for a folding scheme. The Nova line of work, and SuperNova and related constructions after it, lets a prover fold many instances of the same step into a single accumulated instance, so proving cost grows far more slowly than "one full proof per step." Our zkML benchmark already flagged folding as a live trend bending the cost curve for ordinary models. For a transformer's per-token loop the structural fit is even better: the loop is exactly the shape folding was designed to compress.

**Hardware-accelerated proving.** The proving bottleneck is concrete arithmetic — multi-scalar multiplications, number-theoretic transforms — and that arithmetic parallelizes. GPU proving is real today, most stacks are not yet running fully optimized kernels, and FPGA and eventually ASIC proving sit further out on the same road. The benchmark's estimate for ordinary models was a plausible order-of-magnitude speedup from hardware alone, before any algorithmic cleverness. None of that physics changes because the model is a transformer.

Stack the two. Folding compresses the loop; hardware accelerates each folded step. Neither requires a conceptual breakthrough — both are roadmaps, the grind kind of progress, not the eureka kind. Grind is predictable, which is good news. It is also why nobody serious will hand you a date: "frontier-scale transformers become routinely provable" is years out, plural, and an honest practitioner declines to pick one.

What you _can_ say with confidence: the direction is monotonic. Every folding generation and every proving-hardware generation makes a larger model provable at a tolerable cost than the one before. The ceiling is rising. It is simply not rising fast enough to change what you deploy this year.

## Build now vs. wait

Which turns the milestone into an actual decision. If you are choosing a verifiable-inference architecture in 2026, DeepProve-1 should change your roadmap and not your build. The split:

**If your model is small and fixed, zkML is shippable today.** This was the verdict of the [zkML benchmark](/blog/which-zkml-ships/) and DeepProve-1 does not soften it — a few-million-parameter MLP, INT8-quantized, gets a sub-1 KB proof and cheap on-chain verification with mature tooling. A risk model, a scoring model, a recommendation model with a stable architecture: prove it now with EZKL or a comparable Halo2 stack, and move on.

**If your model is a large transformer doing real reasoning, you cannot prove it today — and the answer is still a TEE.** This is the load-bearing recommendation. When a workload genuinely needs a frontier-scale transformer, reach for a trusted execution environment now: attested hardware gives an integrity guarantee at single-digit-percent overhead, in production, today. opML — an inference claim backed by a bond and a challenge window — is the other shippable option when your trust model tolerates game-theoretic finality instead of a cryptographic proof.

The trap is reading "first proven LLM" as permission to wait. _zkML is coming, so hold off on TEE and adopt zkML when it is ready._ That is a multi-year bet against a roadmap with no committed date — swapping a deployable architecture for a research timeline and calling it forward-looking.

The discipline:

| Your model                        | Verify it today with                  | Migrate to zkML when                      |
| --------------------------------- | ------------------------------------- | ----------------------------------------- |
| Small, fixed (MLP, small CNN)     | zkML now — it already ships           | Already there                             |
| Mid-size transformer (BERT-class) | TEE, or opML if finality model allows | Folding + hardware close the cost gap     |
| Frontier-scale LLM                | TEE — full stop                       | Years out; do not architect around it yet |

Build for what is deployable. Track DeepProve and the folding-and-hardware curve as a _migration trigger_ — not as a reason to leave a slot in your architecture empty and hope.

## The implication for on-chain agents

The part of this that is genuinely exciting sits one layer up, in agents. Our field guide to [on-chain agents](/blog/onchain-agents-vs-agents-that-touch-chains/) drew a ladder of chain integration and put the meaningful tiers out of reach for LLM-driven agents for one reason: you cannot verify the _reasoning_. You can see an agent's transactions on-chain. You cannot see _why_ it made them. The decision step — the LLM call that turned a market state into an action — is an unverified black box an on-chain contract has to take on faith.

zkML closes that gap. A proof of LLM inference lets an off-chain prover assert "I ran this exact model on this exact input and got this output," and lets an on-chain contract _verify_ that assertion and act on it. The reasoning step stops being a black box and becomes a checkable claim — precisely the capability the agent ladder names as the boundary between an agent that touches chains and one that is genuinely, verifiably on-chain.

DeepProve-1 is the first brick in that path. Today it proves GPT-2, slowly — which puts a verifiable LLM agent reasoning at GPT-2 scale, on a non-interactive timescale, at the edge of conceivable rather than over the horizon. A trading agent that proves the model behind every position. A treasury agent whose policy is on-chain _and_ whose every decision carries a proof the policy produced it. None of that ships at frontier scale this year. The point is that it stopped being a category error and became a cost curve — and cost curves move.

When provable transformer inference reaches useful scale, the hard question stops being "can we trust the operator not to lie about the model" and becomes "is the proof cheap enough for this use case" — a budgeting question, not a trust question. DeepProve-1 is the first evidence we are heading toward that world.

## The recommendation

DeepProve-1 is real and it matters. It also does not change what you build this quarter. Hold both. A decision tree for a 2026 verifiable-inference architecture:

1. **Is your model a small, fixed network — an MLP, a small CNN?** Yes → zkML ships today. Use a mature Halo2 stack and quantize at training time. Done.
2. **Is it a transformer doing real reasoning?** Yes → continue.
3. **Do you need a cryptographic guarantee right now?** Yes → TEE. zkML at this scale is not deployable today and will not be for years.
4. **Can your trust model tolerate game-theoretic finality — a bond and a challenge window?** Yes → opML is an option alongside TEE.
5. **Tempted to delay deployment and wait for zkML to mature?** Don't. Ship the TEE; treat the folding-and-hardware curve as a trigger you monitor, not a slot you leave empty.

The thesis, restated: DeepProve-1 moved "prove a transformer" from impossible to expensive, and expensive is the kind of problem folding schemes and proving hardware erode, year over year, predictably. The right posture is neither hype nor dismissal — it is patience with a calendar reminder.

## Reading list

- [Lagrange](https://www.lagrange.dev/) — the team behind DeepProve; their site is the primary source for the framework and its roadmap.
- [EZKL](https://ezkl.xyz/) — the mature ONNX-to-Halo2 zkML proving library, and the right starting point for small, fixed models you can prove today.
- Our own [five-zkML-libraries benchmark](/blog/which-zkml-ships/) — proof times, gas costs, and the small-model ceiling that DeepProve is the first crack in.

Prove the MLP now. Reach for a TEE on the transformer. Watch the ceiling rise — and migrate the day the proof gets cheap enough, not a year before.