Decentralized training in 2026: what works, what's still vapor.
A grounded look at distributed pretraining across untrusted GPUs. DiLoCo, DisTrO, INTELLECT-2, Bittensor's Templar, 0G's DiLoCoX — what each actually shipped, and what hasn't.

A grounded look at distributed pretraining across untrusted GPUs. DiLoCo, DisTrO, INTELLECT-2, Bittensor's Templar, 0G's DiLoCoX — what each actually shipped, and what hasn't.

Three years ago, “decentralized training” was a slide in a token whitepaper. In 2026 it’s a half-dozen production networks, a handful of real papers, and a 32-billion-parameter model that was trained over the public internet across three continents.
Decentralized-training work covers a lot of surface area — subnet design, validator policy, sandbagging detection, gradient verification — and the honest picture is more interesting than either the hype or the skepticism. Some things work. Some things don’t. The space is still moving fast. This post is the map you want before you start.
Training large models on a single cluster is a solved engineering problem. The hardware is in one room, the network is fast, the operator is trusted. You synchronize gradients across hundreds of GPUs over an NVLink + InfiniBand fabric and you call it a day. Bandwidth between any two GPUs is hundreds of GB/s. Latency is microseconds.
Training large models across the internet, on hardware operated by untrusted parties, is a different problem entirely. Bandwidth between any two GPUs is megabytes per second. Latency is hundreds of milliseconds. The party providing the GPU might be lying about the work it did. The network might drop a participant mid-step. None of the classical distributed-training algorithms — all-reduce SGD with synchronous updates — survive this environment. They require bandwidth and trust that don’t exist.
Two questions define the field:
Everything else — incentives, tokenomics, subnet structures — flows from the answers.
The 2023 DeepMind paper DiLoCo (Distributed Low-Communication training) is the foundation. The trick is straightforward: instead of synchronizing gradients every step, each worker runs many steps locally and only synchronizes every H steps, with H typically between 100 and 1000.
Each worker holds a full copy of the model. Each worker runs H local optimizer steps on its own data shard. After H steps, all workers exchange their local model deltas (not gradients — deltas, after H steps of updates). The deltas are averaged. The averaged delta is applied to a global model state. Workers pull the new global state. Repeat.
The mathematical sleight of hand: instead of treating the workers as parallel-SGD nodes, treat them as participants in a meta-optimization where the outer loop is across worker-deltas and the inner loop is the local steps. The outer loop runs once every H steps. Communication is H× less frequent than vanilla synchronous SGD.
DiLoCo showed this matches synchronous-SGD performance on language-model pretraining at H = 500. Communication volume drops by ~500x. The cost is wall-clock time per step (workers do more local work before syncing) and tolerance for stale shards (the data each worker sees during its local steps cannot drift too far from the global distribution).
OpenDiLoCo, released by Prime Intellect in July 2024, reproduced DiLoCo’s results on a publicly trained model, with workers spread across two continents. Bandwidth between workers averaged ~125 Mbps. They demonstrated that you can match a single-data-center training run across the internet if you accept H = 500.
This was the moment the field went from “interesting research” to “shippable infrastructure.”
Nous Research published DisTrO in August 2024 and started running live training over a webcam-grade internet feed. Their claim — a ~10,000x reduction in communication versus all-reduce SGD — is a multiplicative stack:
The full DisTrO recipe is more complex than DiLoCo and trickier to get right (the sparse masking interacts badly with certain optimizer states), but it does what it claims. Nous has been live-streaming a 15B-parameter pretraining run for months at distro.nousresearch.com, training over commodity bandwidth.
What DisTrO does not do: scale to frontier-model parameter counts. The 15B run is the biggest publicly demonstrated DisTrO training. Scaling beyond ~30B parameters runs into the same memory-per-worker constraints that single-cluster training has, because each worker still holds a full model copy. Pipeline-parallelism and tensor-parallelism on top of DiLoCo are research-grade as of this writing.
INTELLECT-2, released May 2025 by Prime Intellect, is the largest decentralized RL-trained model to date. 32 billion parameters. Trained on 285,000 math and coding tasks from NuminaMath-1.5. The training was globally distributed, asynchronous, and on hardware contributed by participants without a central trust party.
INTELLECT-2 introduced two pieces of infrastructure I want to highlight:
TOPLOC (Trusted OPtimistic LOCality-sensitive hashing). When a worker produces an RL rollout (a model trajectory through a task), the worker also publishes a locality-sensitive hash of the rollout. The hash is cheap to compute, cheap to verify, and changes drastically if the rollout has been tampered with. A validator can re-run a small fraction of rollouts, recompute their LSH, and check against the worker’s claim. If too many disagree, the worker is suspect.
TOPLOC is not a zero-knowledge proof of correctness. It is a probabilistic spot-check. A malicious worker can sometimes fake a rollout that happens to hash close to a legitimate one — but the rate at which they can do this is low enough that paying them is economically a bad bet. This is the same intuition that makes optimistic rollups work.
SHARDCAST. A tree-based weight distribution protocol. When a new global model state is computed, it’s not broadcast naively (1-to-N, where the broadcaster’s bandwidth becomes a bottleneck). It’s structured into a tree where each node forwards to a small fanout, and the network self-balances based on observed bandwidths. This is the same idea as BitTorrent’s piece-selection algorithm adapted for live model-weight delivery. It scales to thousands of workers without a centralized weight server.
Both of these are now part of PRIME-RL, Prime Intellect’s open-source RL framework. They are real, working, in-production code.
A note on what came after. Prime Intellect’s follow-up reasoning model, INTELLECT-3 — a 106-billion-parameter mixture-of-experts, post-trained from GLM-4.5-Air with large-scale RL — was not trained this way. It ran on a centralized 512-GPU H200 cluster. The team most invested in decentralized training picked a single cluster for its flagship RL run. That is the honest state of the field: decentralized methods are real for pretraining at moderate scale, but frontier-grade RL post-training still happens in one room.
Bittensor operates differently. It is a network of “subnets” — each subnet a market for a specific kind of AI work, with its own validator policy and reward function. Subnet operators set the rules; miners (workers) compete to produce the best outputs; validators score them; the TAO emissions distribute proportionally to scored quality.
The Templar subnet (Subnet 3, formerly under different ownership) was repurposed in 2024 to coordinate decentralized pretraining. In March 2026 it announced Covenant-72B, a 72-billion-parameter model trained on roughly 1.1 trillion tokens across the subnet’s miners. According to the public reporting, this is the largest decentralized pretraining run to date, eclipsing INTELLECT-2’s 32B.
I want to flag two things about Covenant-72B specifically.
First, the run was structured as a subnet competition. Miners weren’t running a single coordinated training job; they were competing to produce gradient updates that the subnet validators considered high-quality. The “model” that emerges is a curated aggregation of miner contributions, not the output of a single SGD trajectory. This is closer to a federated-learning competition than to classical training.
Second, the verification is score-based, not proof-based. Validators score miner contributions by running held-out evaluations and ranking. There is no cryptographic check that any individual miner’s submitted gradient is correct. The subnet’s incentive design assumes miners will compete to produce useful gradients because that’s the only way to earn TAO; sandbagging (submitting plausible-looking junk to harvest emissions) is detected statistically over many rounds.
I have concerns about this design at scale. If a sandbagger discovers a way to produce plausible-looking junk that the validators score as legitimate, they earn emissions until detected. Detection in subnets typically takes weeks. The economics of sandbagging therefore depend on (a) how hard it is to produce junk that scores, and (b) how fast the validator’s policy adapts. These are open empirical questions. The Templar subnet seems to have it under control today; whether the same design scales to a 700B model is unclear.
0G Labs has described DiLoCoX-107B — a 107-billion-parameter model they report training decentrally in 2025. The communication claim is striking: 357x greater efficiency than the DiLoCo baseline.
The DiLoCoX paper (arXiv 2506.21263) describes a more aggressive sparsity + quantization stack on top of DiLoCo, plus topology-aware scheduling. The 357x is the multiplicative product of all those tricks; the marginal improvement over DisTrO’s ~10,000x stack is unclear from public materials.
Even with the paper public, DiLoCoX-107B is hard to fully externally validate. The constituent techniques are real and documented, but the headline run’s training data, evaluation suite, and gradient-verification protocol are not all reproducible from public materials. The claim is plausible; independent reproduction is not. The Bittensor Covenant-72B announcement has the same shape: the model is real, the network is live, but the verification that the work was done as claimed is a question of trust in the operators rather than cryptographic certainty.
This is the gap between “decentralized training” as a marketing term and as an engineering reality. Most current systems are distributed (work is spread across many parties), but the trust model still requires you to trust the operators of those parties — the subnet’s validator set, the network’s emissions policy, the team auditing the training run. The cryptographic primitives that would remove the last trust assumption (zero-knowledge proofs of gradient computation, verifiable forward/backward passes) exist as research, but no large-scale training run has used them as the primary verification layer. They’re still 100,000x too expensive.
Setting aside the academic ideal of cryptographically proven training, here is what the production networks do today.
Statistical sandbagging detection. The validator policy runs each worker’s contributions through a battery of cheap checks: distribution of gradient norms, agreement with neighbor workers on the same data shard, presence/absence of expected gradient signatures from known-good runs. Workers whose statistics drift outside expected ranges get down-weighted.
Spot-check re-execution. A small fraction (~1%) of worker rollouts are re-executed by a designated validator. Mismatches trigger an investigation. The fraction is tunable — higher costs more compute but catches faster.
Locality-sensitive hashing (TOPLOC). As described above. The probability that a malicious worker can fake a rollout that passes LSH is small enough that the expected value of cheating is negative across many rounds.
Reputation accumulation. Workers build up a track record. New workers are paid less per unit of work until they accumulate trust. The cost of building a new worker identity exceeds the short-term cheating reward.
Held-out eval gates. New model checkpoints are gated by performance on a held-out evaluation suite that workers cannot directly optimize against. Sandbagging that produces good training-loss-looking gradients but bad held-out performance is filtered at the gate.
None of these alone is bulletproof. Together they make sustained cheating expensive. The economics work — in a well-run Bittensor subnet, the dominant strategy after one month of operation is “actually do the work” because the alternative paths to TAO emissions are statistically detectable.
Working today:
Live but unproven at scale:
Vapor or marketing:
For teams considering decentralized training, the right questions are:
What model size are you targeting? Below 30B, DiLoCo-style on Prime Intellect, Nous, or your own coordinator is genuinely cheaper than renting a single A100 cluster. Above 70B, the engineering complexity ratchets up sharply; consider whether the cost arbitrage is worth it.
What’s your data governance? Decentralized training is genuinely interesting when your training data can’t leave individual workers’ premises — federated learning is the use case where the architecture adds value beyond cost. If you can ship your data to a single cluster, do that.
Do you trust your worker operators? Bittensor subnets are great if you’re comfortable with the subnet’s validator policy. Prime Intellect’s coordinator is great if you’re comfortable with a single operator. Pure peer-to-peer is great if you’re comfortable doing the verification engineering yourself.
What’s your verification budget? Statistical spot-checking is cheap (~1% of compute). Full re-execution is 100% of compute. ZK proofs are 100,000% of compute. Pick a point on this curve based on how adversarial your workers are.
Is there a credible exit path? If the decentralized network you’re depending on shuts down or has a governance crisis, can you migrate to a centralized provider? Most engagements should retain that optionality.
If you have $2M and you want to pretrain something useful in a decentralized way, the playbook is:
Two years from now this will all look different. The communication-efficient SGD work will continue. ZK-training will come down in cost. Some of the subnets will fold; some will become important infrastructure. The principle is settled: decentralized training is real, it works at 10-100B scale, and the people doing it now are accumulating moats. The unknown is which networks survive.
The right way to evaluate a network claim is to read the code. Most of these networks are open source; their gradient verification, their validator policies, their emissions curves are all on GitHub. Don’t take press releases at face value. Read the code.

A one-word change to a system prompt can move accuracy by dozens of points, and a provider's model update can regress your app overnight. A prompt or model swap is a deploy. Give it a staged rollout and a one-action rollback path.
11 min →
The monthly inference bill arrives as one number, and nobody can say which agent, which customer, or which tool spent it. Agent cost is too variable to estimate and has to be attributed after the fact — per run, per tool, per tenant. The layer most stacks skip.
11 min →
An agent that asks permission for everything trains its reviewers to rubber-stamp, and the one dangerous action slips through in the noise. Approval gates belong on consequence and on uncertainty — not on every step. Where to put them.
12 min →