Menu
← FIELD NOTESTRAINING 2026.04.10 · 13 min

Decentralized training in 2026: what works, what's still vapor.

A grounded look at distributed pretraining across untrusted GPUs. DiLoCo, DisTrO, INTELLECT-2, Bittensor's Templar, 0G's DiLoCoX — what each actually shipped, and what hasn't.

Three years ago, “decentralized training” was a slide in a token whitepaper. In 2026 it’s a half-dozen production networks, a handful of real papers, and a 32-billion-parameter model that was trained over the public internet across three continents.

Decentralized-training work covers a lot of surface area — subnet design, validator policy, sandbagging detection, gradient verification — and the honest picture is more interesting than either the hype or the skepticism. Some things work. Some things don’t. The space is still moving fast. This post is the map you want before you start.

The actual problem

Training large models on a single cluster is a solved engineering problem. The hardware is in one room, the network is fast, the operator is trusted. You synchronize gradients across hundreds of GPUs over an NVLink + InfiniBand fabric and you call it a day. Bandwidth between any two GPUs is hundreds of GB/s. Latency is microseconds.

Training large models across the internet, on hardware operated by untrusted parties, is a different problem entirely. Bandwidth between any two GPUs is megabytes per second. Latency is hundreds of milliseconds. The party providing the GPU might be lying about the work it did. The network might drop a participant mid-step. None of the classical distributed-training algorithms — all-reduce SGD with synchronous updates — survive this environment. They require bandwidth and trust that don’t exist.

Two questions define the field:

  1. Communication efficiency. How do you train across the internet without synchronizing gigabytes per step?
  2. Verification. How do you know the worker did the work?

Everything else — incentives, tokenomics, subnet structures — flows from the answers.

Communication efficiency: DiLoCo and its descendants

The 2023 DeepMind paper DiLoCo (Distributed Low-Communication training) is the foundation. The trick is straightforward: instead of synchronizing gradients every step, each worker runs many steps locally and only synchronizes every H steps, with H typically between 100 and 1000.

Each worker holds a full copy of the model. Each worker runs H local optimizer steps on its own data shard. After H steps, all workers exchange their local model deltas (not gradients — deltas, after H steps of updates). The deltas are averaged. The averaged delta is applied to a global model state. Workers pull the new global state. Repeat.

The mathematical sleight of hand: instead of treating the workers as parallel-SGD nodes, treat them as participants in a meta-optimization where the outer loop is across worker-deltas and the inner loop is the local steps. The outer loop runs once every H steps. Communication is H× less frequent than vanilla synchronous SGD.

DiLoCo showed this matches synchronous-SGD performance on language-model pretraining at H = 500. Communication volume drops by ~500x. The cost is wall-clock time per step (workers do more local work before syncing) and tolerance for stale shards (the data each worker sees during its local steps cannot drift too far from the global distribution).

OpenDiLoCo, released by Prime Intellect in July 2024, reproduced DiLoCo’s results on a publicly trained model, with workers spread across two continents. Bandwidth between workers averaged ~125 Mbps. They demonstrated that you can match a single-data-center training run across the internet if you accept H = 500.

This was the moment the field went from “interesting research” to “shippable infrastructure.”

Pushing further: DisTrO and the 10,000x claim

Nous Research published DisTrO in August 2024 and started running live training over a webcam-grade internet feed. Their claim — a ~10,000x reduction in communication versus all-reduce SGD — is a multiplicative stack:

  • DiLoCo’s H-step batching: ~500x.
  • DCT-based delta compression (quantizing the delta in a learned basis): ~10x.
  • Sparse-top-k masking (only transmit the largest 5% of delta entries by magnitude): ~5x.
  • Per-participant adaptive scheduling: another ~2x in steady-state.

The full DisTrO recipe is more complex than DiLoCo and trickier to get right (the sparse masking interacts badly with certain optimizer states), but it does what it claims. Nous has been live-streaming a 15B-parameter pretraining run for months at distro.nousresearch.com, training over commodity bandwidth.

What DisTrO does not do: scale to frontier-model parameter counts. The 15B run is the biggest publicly demonstrated DisTrO training. Scaling beyond ~30B parameters runs into the same memory-per-worker constraints that single-cluster training has, because each worker still holds a full model copy. Pipeline-parallelism and tensor-parallelism on top of DiLoCo are research-grade as of this writing.

INTELLECT-2: 32B distributed RL

INTELLECT-2, released May 2025 by Prime Intellect, is the largest decentralized RL-trained model to date. 32 billion parameters. Trained on 285,000 math and coding tasks from NuminaMath-1.5. The training was globally distributed, asynchronous, and on hardware contributed by participants without a central trust party.

INTELLECT-2 introduced two pieces of infrastructure I want to highlight:

TOPLOC (Trusted OPtimistic LOCality-sensitive hashing). When a worker produces an RL rollout (a model trajectory through a task), the worker also publishes a locality-sensitive hash of the rollout. The hash is cheap to compute, cheap to verify, and changes drastically if the rollout has been tampered with. A validator can re-run a small fraction of rollouts, recompute their LSH, and check against the worker’s claim. If too many disagree, the worker is suspect.

TOPLOC is not a zero-knowledge proof of correctness. It is a probabilistic spot-check. A malicious worker can sometimes fake a rollout that happens to hash close to a legitimate one — but the rate at which they can do this is low enough that paying them is economically a bad bet. This is the same intuition that makes optimistic rollups work.

SHARDCAST. A tree-based weight distribution protocol. When a new global model state is computed, it’s not broadcast naively (1-to-N, where the broadcaster’s bandwidth becomes a bottleneck). It’s structured into a tree where each node forwards to a small fanout, and the network self-balances based on observed bandwidths. This is the same idea as BitTorrent’s piece-selection algorithm adapted for live model-weight delivery. It scales to thousands of workers without a centralized weight server.

Both of these are now part of PRIME-RL, Prime Intellect’s open-source RL framework. They are real, working, in-production code.

A note on what came after. Prime Intellect’s follow-up reasoning model, INTELLECT-3 — a 106-billion-parameter mixture-of-experts, post-trained from GLM-4.5-Air with large-scale RL — was not trained this way. It ran on a centralized 512-GPU H200 cluster. The team most invested in decentralized training picked a single cluster for its flagship RL run. That is the honest state of the field: decentralized methods are real for pretraining at moderate scale, but frontier-grade RL post-training still happens in one room.

Bittensor’s Templar subnet: Covenant-72B

Bittensor operates differently. It is a network of “subnets” — each subnet a market for a specific kind of AI work, with its own validator policy and reward function. Subnet operators set the rules; miners (workers) compete to produce the best outputs; validators score them; the TAO emissions distribute proportionally to scored quality.

The Templar subnet (Subnet 3, formerly under different ownership) was repurposed in 2024 to coordinate decentralized pretraining. In March 2026 it announced Covenant-72B, a 72-billion-parameter model trained on roughly 1.1 trillion tokens across the subnet’s miners. According to the public reporting, this is the largest decentralized pretraining run to date, eclipsing INTELLECT-2’s 32B.

I want to flag two things about Covenant-72B specifically.

First, the run was structured as a subnet competition. Miners weren’t running a single coordinated training job; they were competing to produce gradient updates that the subnet validators considered high-quality. The “model” that emerges is a curated aggregation of miner contributions, not the output of a single SGD trajectory. This is closer to a federated-learning competition than to classical training.

Second, the verification is score-based, not proof-based. Validators score miner contributions by running held-out evaluations and ranking. There is no cryptographic check that any individual miner’s submitted gradient is correct. The subnet’s incentive design assumes miners will compete to produce useful gradients because that’s the only way to earn TAO; sandbagging (submitting plausible-looking junk to harvest emissions) is detected statistically over many rounds.

I have concerns about this design at scale. If a sandbagger discovers a way to produce plausible-looking junk that the validators score as legitimate, they earn emissions until detected. Detection in subnets typically takes weeks. The economics of sandbagging therefore depend on (a) how hard it is to produce junk that scores, and (b) how fast the validator’s policy adapts. These are open empirical questions. The Templar subnet seems to have it under control today; whether the same design scales to a 700B model is unclear.

0G’s DiLoCoX-107B claim

0G Labs has described DiLoCoX-107B — a 107-billion-parameter model they report training decentrally in 2025. The communication claim is striking: 357x greater efficiency than the DiLoCo baseline.

The DiLoCoX paper (arXiv 2506.21263) describes a more aggressive sparsity + quantization stack on top of DiLoCo, plus topology-aware scheduling. The 357x is the multiplicative product of all those tricks; the marginal improvement over DisTrO’s ~10,000x stack is unclear from public materials.

Even with the paper public, DiLoCoX-107B is hard to fully externally validate. The constituent techniques are real and documented, but the headline run’s training data, evaluation suite, and gradient-verification protocol are not all reproducible from public materials. The claim is plausible; independent reproduction is not. The Bittensor Covenant-72B announcement has the same shape: the model is real, the network is live, but the verification that the work was done as claimed is a question of trust in the operators rather than cryptographic certainty.

This is the gap between “decentralized training” as a marketing term and as an engineering reality. Most current systems are distributed (work is spread across many parties), but the trust model still requires you to trust the operators of those parties — the subnet’s validator set, the network’s emissions policy, the team auditing the training run. The cryptographic primitives that would remove the last trust assumption (zero-knowledge proofs of gradient computation, verifiable forward/backward passes) exist as research, but no large-scale training run has used them as the primary verification layer. They’re still 100,000x too expensive.

Verification in practice: how networks actually check workers

Setting aside the academic ideal of cryptographically proven training, here is what the production networks do today.

Statistical sandbagging detection. The validator policy runs each worker’s contributions through a battery of cheap checks: distribution of gradient norms, agreement with neighbor workers on the same data shard, presence/absence of expected gradient signatures from known-good runs. Workers whose statistics drift outside expected ranges get down-weighted.

Spot-check re-execution. A small fraction (~1%) of worker rollouts are re-executed by a designated validator. Mismatches trigger an investigation. The fraction is tunable — higher costs more compute but catches faster.

Locality-sensitive hashing (TOPLOC). As described above. The probability that a malicious worker can fake a rollout that passes LSH is small enough that the expected value of cheating is negative across many rounds.

Reputation accumulation. Workers build up a track record. New workers are paid less per unit of work until they accumulate trust. The cost of building a new worker identity exceeds the short-term cheating reward.

Held-out eval gates. New model checkpoints are gated by performance on a held-out evaluation suite that workers cannot directly optimize against. Sandbagging that produces good training-loss-looking gradients but bad held-out performance is filtered at the gate.

None of these alone is bulletproof. Together they make sustained cheating expensive. The economics work — in a well-run Bittensor subnet, the dominant strategy after one month of operation is “actually do the work” because the alternative paths to TAO emissions are statistically detectable.

What’s actually shipping vs. what’s vapor

Working today:

  • DiLoCo-style training across the public internet at ≤30B parameters. (OpenDiLoCo, DisTrO, Prime Intellect, Nous.)
  • Distributed RL at 32B. (INTELLECT-2.)
  • Federated pretraining via subnet incentives. (Bittensor Templar.)
  • GPU marketplaces with real, on-chain accounting. (Akash, io.net, Render, Crusoe.)
  • TOPLOC and SHARDCAST as gradient verification + weight distribution primitives.

Live but unproven at scale:

  • Decentralized training above 100B parameters. (0G claims DiLoCoX-107B, Bittensor Templar claims Covenant-72B, neither has fully public verification.)
  • Tensor/pipeline-parallel DiLoCo. (Research stage; not in production networks.)
  • ZK-proven training. (RISC Zero and others can do this in principle; the overhead at GPT-class scale is still 100,000x+.)

Vapor or marketing:

  • “We trained a frontier model on the blockchain.” (Translation: we ran a training job and used a chain for accounting. The training was on regular GPUs in a regular data center.)
  • “Trustless training.” (No production system is cryptographically trustless. They are trust-minimized via incentives and spot-checks.)
  • Token-incentivized training without a verification layer. (If the rewards are visible but the work isn’t checked, you’re paying sandbaggers.)

When to actually use this

For teams considering decentralized training, the right questions are:

  1. What model size are you targeting? Below 30B, DiLoCo-style on Prime Intellect, Nous, or your own coordinator is genuinely cheaper than renting a single A100 cluster. Above 70B, the engineering complexity ratchets up sharply; consider whether the cost arbitrage is worth it.

  2. What’s your data governance? Decentralized training is genuinely interesting when your training data can’t leave individual workers’ premises — federated learning is the use case where the architecture adds value beyond cost. If you can ship your data to a single cluster, do that.

  3. Do you trust your worker operators? Bittensor subnets are great if you’re comfortable with the subnet’s validator policy. Prime Intellect’s coordinator is great if you’re comfortable with a single operator. Pure peer-to-peer is great if you’re comfortable doing the verification engineering yourself.

  4. What’s your verification budget? Statistical spot-checking is cheap (~1% of compute). Full re-execution is 100% of compute. ZK proofs are 100,000% of compute. Pick a point on this curve based on how adversarial your workers are.

  5. Is there a credible exit path? If the decentralized network you’re depending on shuts down or has a governance crisis, can you migrate to a centralized provider? Most engagements should retain that optionality.

What I’d build right now

If you have $2M and you want to pretrain something useful in a decentralized way, the playbook is:

  • Target a 10-30B parameter model. Above that, the engineering surface area explodes.
  • Use Prime Intellect’s PRIME-RL framework or a fork. It’s the most mature open-source decentralized-training codebase as of this writing.
  • Use TOPLOC for gradient verification. The implementation is open; the math is solid.
  • Run on Prime Intellect’s worker network or a custom Bittensor subnet. The choice depends on whether you want their validator policy or your own.
  • Keep a centralized escape hatch: a contract with Crusoe or Lambda for a 256-GPU cluster you can spin up in 48 hours if the decentralized run hits a wall.
  • Plan for the training run to be slower than the equivalent centralized run by 1.5-3x in wall clock. The cost advantage is in dollars-per-step, not steps-per-day.

Two years from now this will all look different. The communication-efficient SGD work will continue. ZK-training will come down in cost. Some of the subnets will fold; some will become important infrastructure. The principle is settled: decentralized training is real, it works at 10-100B scale, and the people doing it now are accumulating moats. The unknown is which networks survive.

Reading list

The right way to evaluate a network claim is to read the code. Most of these networks are open source; their gradient verification, their validator policies, their emissions curves are all on GitHub. Don’t take press releases at face value. Read the code.

NEW ENGAGEMENT · INTAKE

Tell us about it.

The more specific you are, the more useful our first reply.

SERVICE AREA
↩ ENCRYPTED IN TRANSIT