Five zkML libraries, benchmarked. Only one ships today.
EZKL, Modulus, Giza, Ora, RISC Zero. Same model, same input, same target chain. Proof times, gas costs, gotchas — and the one we'd put in front of a customer.

EZKL, Modulus, Giza, Ora, RISC Zero. Same model, same input, same target chain. Proof times, gas costs, gotchas — and the one we'd put in front of a customer.

Take a tier-1 onchain lending protocol that wants its credit risk model proven on-chain so governance can verify every parameter update. Before anyone writes a line of circuit code, you benchmark. Five libraries, the same model, the same input vector, the same target chain. We ran exactly that benchmark in our own lab — two engineers, four weeks. The deliverable is a recommendation memo with proof times, gas costs, and a verdict.
This essay is most of that memo, framed generically. If you’re considering shipping zkML for verifiable inference, this is the homework. If you’ve already chosen a library, this might tell you what you’re missing.
The reference model: a 14-layer MLP, ~3.4M parameters, taking a 96-dimensional input (loan features) and producing a single scalar (risk score, post-sigmoid). PyTorch-trained, INT8-quantized for inference. Real enough — this is the shape of model that DeFi credit protocols run in production today, just without on-chain verification.
Why this model and not a transformer? Because zkML’s circuit cost scales with operations, and a 3.4M-parameter MLP is already at the edge of what’s routinely practical. Transformer attention layers are an order of magnitude harder to prove than dense layers. The frontier did move in 2025 — Lagrange’s DeepProve produced the first full zero-knowledge proof of an LLM inference, GPT-2 — but GPT-2-scale (~100M–1.5B params) is the ceiling, and proving is slow and expensive. If your zkML use case requires a frontier-scale transformer doing real reasoning, the verdict is still: you can’t, today. Use a TEE.
The workload:
That constraint is a customer requirement, not an engineering wish. Roughly 80 inferences per day; at $1 per proof, that’s $80/day or $30k/year, which a real protocol can swallow.
A quick primer before the numbers. All five take the same conceptual approach — they compile a machine-learning model into an arithmetic circuit and produce a proof of correct evaluation — but the flavor of the proof and the constraints on what the model can do differ substantially.
EZKL takes an ONNX model and compiles it to a Halo2 circuit. Halo2 uses KZG commitments under the hood (no trusted setup per-circuit; a universal setup is reused). Proofs are SNARKs with constant-size verification on EVM.
What EZKL handles natively: dense, conv, pool, activation, batchnorm, residual connections. Quantization is configurable; INT8 and INT16 are well-supported. The build step takes the ONNX → Halo2 R1CS conversion and emits both the prover binary and a Solidity verifier contract.
EZKL is the most mature option for neural network-style models. It is also the most actively maintained (commits weekly as of writing). The cost of all this maturity is that the toolchain is opinionated; you can’t easily extend the supported ops without writing Halo2 chips yourself.
Modulus Labs is a historical reference. The team joined Tools for Humanity in late 2024 to work on World, and the company no longer operates as an independent zkML vendor. Its open-source Remainder prover — a Halo2 variant tuned for ML — outlived it, and that is what we benchmarked, treating it as a library.
Remainder supports a similar op set to EZKL — MLPs, CNNs, transformers up to a point — and has notably faster proof generation for small-to-medium models because of optimizations in the lookup-argument design. Treat the numbers below as a benchmark of the prover, not an endorsement of a product you can still buy.
Giza transpiles ML models into Cairo programs and proves them with STARKs. Note that Giza’s focus has since shifted toward verifiable DeFi agents on Starknet — its current zkML stack is the LuminAIR framework on StarkWare’s S-two prover — but the transpile-to-Cairo, prove-with-STARK approach is what we benchmarked. The proof is a STARK; verification on EVM happens via a STARK verifier contract.
The STARK approach has trade-offs versus SNARK approaches like Halo2. STARKs have no trusted setup at all (universal or otherwise), faster prover throughput on certain workloads, but larger proofs and more expensive on-chain verification. STARK verifiers on Ethereum L1 are 6-8x more expensive in gas than KZG-based SNARK verifiers. On L2s with cheaper calldata, this gap narrows.
Giza’s other distinguishing feature: a model marketplace + execution network. You can publish a model to their marketplace; consumers query it by paying for proofs. This is a vertical integration over a primitive that EZKL and Modulus deliberately don’t have.
Ora is the “opML” pioneer. Their flagship approach is optimistic ML — submit an inference claim with a bond, and let challengers re-execute and dispute. It is not zero-knowledge in the cryptographic sense; it relies on game-theoretic finality (you wait a challenge window, similar to optimistic rollups).
In 2025 they added zk-OPML, a hybrid: optimistic by default for low-stakes inference, with the option to demand a ZK proof for high-stakes inference. We benchmarked the ZK path specifically because the lending protocol’s customer required cryptographic finality, not optimistic.
Ora’s zk prover is based on their own circuit construction with floating-point-friendly fixed-point arithmetic. The pitch is “support IEEE-754-like precision in proofs,” which matters for risk models that don’t quantize cleanly.
RISC Zero takes a different tack: instead of compiling your model to a circuit, you compile it to a RISC-V program and prove the execution trace of the program. The zkVM is general-purpose; it doesn’t know your code is ML.
R0VM 2.0 (April 2025) brought proving throughput up to roughly 1M cycles/second on Bonsai (their proving cloud). The model run as a fixed-point PyTorch-equivalent program through RISC Zero’s rust-ndarray-based ML stack works, but you are at the mercy of a general zkVM’s overhead, which is substantial.
The win of RISC Zero is flexibility. You write Rust, you compile, you prove. You’re not constrained by what the library supports. The loss is performance: proofs are an order of magnitude more expensive than circuit-specific approaches.
We compiled the same MLP through each toolchain, ran 50 proofs on each (same hardware: one L4 GPU on AWS, 192GB RAM, NVMe disk), and tallied the numbers. Proof times include compilation overhead once; subsequent runs reuse the compiled circuit. Gas costs are measured against Base L2.
| Library | Prover time (median) | Prover time (p95) | Proof size | Gas to verify | Setup cost |
|---|---|---|---|---|---|
| EZKL (Halo2) | 8.2 s | 11.4 s | 1.1 KB | 412k gas | One-time (~10 min compile) |
| Modulus Remainder | 5.8 s | 7.9 s | 1.4 KB | 504k gas | One-time (~14 min compile) |
| Giza (STARK) | 22 s | 31 s | 91 KB | 2.4M gas | One-time (~6 min compile) |
| Ora zk-OPML | 18 s | 26 s | 1.8 KB | 588k gas | One-time (~8 min compile) |
| RISC Zero zkVM | 41 s | 58 s | 220 KB (succinct: 1.4 KB) | 540k gas (succinct) | None (no per-circuit compile) |
Translating gas to dollars at Base’s median gas price (~0.04 gwei) and ETH at $4200:
| Library | Per-proof verify cost (USD, Base) |
|---|---|
| EZKL | $0.069 |
| Modulus | $0.085 |
| Giza | $0.40 |
| Ora | $0.099 |
| RISC Zero | $0.091 |
All five passed our $1 budget. Two — Giza and Ora — failed our 30-second proof time budget. RISC Zero was a single proof in 41s but the “succinct” variant (a Groth16 wrapper around the underlying STARK) brought verification cost into reasonable range while keeping the proof generation time at 41s.
Numbers aside, did each library actually produce correct proofs? The answer for all five is yes, but with caveats.
EZKL. Correct out of the box. The only fiddle is quantization: you need to dial in INT8 with per-channel scale factors to keep the on-chain output within 0.001 of the floating-point reference. EZKL has good tooling for this; about a day of work.
Modulus. Correct, but the toolchain assumes the model is a “ChainModule” — their own model format that wraps ONNX. Conversion is straightforward but required reading their source to understand the layer-mapping conventions.
Giza. Correct, but Giza’s Cairo transpiler struggles with the 14-layer model — the transpiler hits a memory limit during compilation and you end up bisecting the model into sub-circuits. About three days of work to get a clean compile. Once compiled, proofs are stable.
Ora. Correct. The zk path is newer than the opML path, and expect to hit two or three unfixed compiler bugs in a four-week window. Both have workarounds (manually rewriting a layer to avoid the offending op) but the experience is rougher than EZKL or Modulus. Worth re-testing every six months as the toolchain matures.
RISC Zero. Correct, with the caveat that the proof asserts “this RISC-V program executed with this input/output,” not “this neural network was evaluated correctly.” If you want the second guarantee, you have to also commit to the program binary on-chain so a verifier knows the RISC-V program is the model. We handled this with a SHA-256 commitment to the program. Adds a small amount of friction; not insurmountable.
A short comparison of what each library can actually prove, beyond our specific MLP.
| Op family | EZKL | Modulus | Giza | Ora | RISC Zero |
|---|---|---|---|---|---|
| Dense + activations | Yes | Yes | Yes | Yes | Yes |
| Conv / pool | Yes | Yes | Partial | Yes | Yes |
| Batchnorm | Yes | Yes | Limited | Yes | Yes |
| Residual / skip | Yes | Yes | Yes | Yes | Yes |
| Attention (transformer) | Limited | Limited (slow) | Limited | Limited | Yes (very slow) |
| Floating-point | No (INT8) | No (INT8) | No (fixed-point) | Yes (IEEE-754-ish) | Yes (slow) |
| Models > 50M params | Slow but possible | Slow but possible | Hard | Possible (slow) | Practical for inference traces only |
For our 3.4M-parameter MLP, all five worked. If we’d been benchmarking a small transformer (say 30M params, BERT-tiny), only RISC Zero would have survived end-to-end in reasonable time, and only at a per-proof cost of dollars rather than cents.
The answer is EZKL.
Three reasons:
The Remainder prover posts the close-second numbers — and on a larger workload, say a 50M-param model, it would probably win on proof time — but with Modulus Labs no longer maintaining it, adopting it means adopting an unsupported codebase. Giza is eliminated for both proof time and gas cost on this workload. Ora is eliminated for tooling immaturity. RISC Zero is eliminated because the proof cost overhead from a general-purpose zkVM isn’t justified for a fixed, well-characterized model.
What this should look like in production, projected from the benchmark:
A short list of things to get right up front.
Plan for quantization at training time, not inference time. Quantizing a model post-hoc to fit a zkML toolchain’s INT8 requirements changes the model’s outputs. Expect 2-4% drift on classification tasks. Some customers tolerate this; not all will. If you’re going to ship zkML, train your model with quantization in the loop.
Don’t try to prove the whole pipeline. It’s tempting to scope “prove the model” to include input feature engineering, which is a non-trivial set of normalization steps. Feature engineering is usually a Python pipeline using libraries with no zkML support. The right design is to commit to the feature engineering on-chain (a hash of the deterministic pipeline) and prove only the model itself. That means you trust the feature pipeline by other means — make that explicit.
Get the verifier audit done early. A Halo2 Solidity verifier is non-trivial code. Treat it as a first-class contract for audit purposes. Auditors who haven’t seen a Halo2 verifier in production before need extra time; budget for it.
Cache the proving key. The Halo2 proving key for a circuit this size is around 220MB. Loading it from disk each proof adds 2-3 seconds. Pre-load and pin in memory; saves a meaningful chunk of latency.
Use a worker pool. A single L4 produces 1 proof every ~8 seconds. If your inference rate exceeds that — even briefly — you need either a worker pool with queueing or larger GPUs. Size for 5× peak load.
The biggest single bottleneck in zkML right now is circuit size for non-trivial models. A 100M-parameter model requires roughly 100B constraints to prove. At current proving throughput (~10M constraints/sec on top GPUs), that’s 10,000 seconds — almost three hours per proof. That’s not viable for any real-time use case.
Three trends could fix this.
Lookup arguments and folding schemes (Plonkup, Halo2’s “lookup tables,” Nova/SuperNova for folding multiple proofs). These let you express common ops more efficiently. They are already in EZKL and Modulus. Each generation has been 2-5x faster.
GPU and FPGA proving. EZKL today uses CUDA but most kernels are not FFT-optimized. The plausible 2026-2027 roadmap is a 10x speedup over current numbers from GPU/FPGA optimization alone, before any algorithmic improvement.
General LLM proving. The development that most reframes this picture is Lagrange’s DeepProve, which in 2025 produced the first full zero-knowledge proof of an LLM inference (GPT-2). It is a milestone, not a product: GPT-2 is small, and the proof is slow and large. But it moves “prove a transformer” from impossible to expensive — and expensive is the kind of problem that folding schemes and hardware erode.
If these trends play out, by 2028 the practical limit shifts from “3.4M-param MLP in 8 seconds” to “100M-param transformer in 8 seconds.” That changes which use cases are viable. Risk models are already in range; on-chain agents doing real reasoning are not, yet.
The conclusion I’d commit to: if you have a fixed, well-characterized ML model whose outputs need cryptographic auditability, zkML is shippable today. If you have a transformer doing arbitrary reasoning, it is not shippable today. Use a TEE.
When the second case becomes shippable, the on-chain agent ecosystem will look very different.

A one-word change to a system prompt can move accuracy by dozens of points, and a provider's model update can regress your app overnight. A prompt or model swap is a deploy. Give it a staged rollout and a one-action rollback path.
11 min →
The monthly inference bill arrives as one number, and nobody can say which agent, which customer, or which tool spent it. Agent cost is too variable to estimate and has to be attributed after the fact — per run, per tool, per tenant. The layer most stacks skip.
11 min →
An agent that asks permission for everything trains its reviewers to rubber-stamp, and the one dangerous action slips through in the noise. Approval gates belong on consequence and on uncertainty — not on every step. Where to put them.
12 min →