Menu
← FIELD NOTESTRAINING 2026.05.09 · 12 min

RL environments are the new dataset.

Post-training has shifted from supervised fine-tuning on static labeled data toward reinforcement learning, and that moves the unit of data work from a labeled file to an executable environment. Building good environments is the new data engineering — and the scarce input.

For most of the last decade, the question “what data do you have” had a tidy answer: a file. Rows, columns, labels. You could open it, count it, sample it, and hand it to a fine-tuning run. The unit of data work was a static, labeled dataset, and a generation of tooling — annotation platforms, dataset registries, quality dashboards — grew up around producing and inspecting that file.

Frontier post-training has been quietly walking away from that picture. The reasoning models that defined 2025 were not made by showing a base model more labeled examples; they were made by letting a model act — emit a chain of thought, call a tool, write and run code — and scoring what it did. That is reinforcement learning, and RL does not consume a file. It consumes a world: a task the model can attempt, an executable place to attempt it, and a signal that says how well it did.

The labs noticed the new bottleneck before the rest of us did, and they have been spending accordingly — Anthropic leadership has reportedly discussed spending more than $1 billion on RL environments over a single year, and the established data-labeling firms are pivoting their organizations to build them. This post is about that shift: why the unit of training-data work is now an RL environment, why good environments are genuinely hard to build, and how to think about building versus sourcing them.

SFT consumes a file; RL consumes a world

The split is cleaner than the marketing makes it sound. Supervised fine-tuning takes pairs — a prompt and a target completion — and trains the model to reproduce the target. The data is the file of pairs. You audit it the way you audit any corpus: open it, look at the distribution, find the mislabeled rows.

Reinforcement learning does not work from targets. It works from outcomes. The model generates its own attempt — its own trajectory through a task — and a reward signal scores that attempt; the training step pushes the model toward trajectories that scored well. Nothing in that loop is a labeled file. The model produced the trajectory, so the trajectory cannot have been collected in advance. What has to exist in advance is the machinery that lets the model attempt the task and the machinery that scores the result. That machinery is the environment, and it is the new unit of data work.

This is not a minor reframing. SFT growth has not stopped — it is especially useful for interleaved thinking and tool-calling, where you can pick one good trajectory and train on it — but the capability that frontier post-training is chasing now, multi-step reasoning and agentic tool use, is RL-shaped. And RL-shaped work needs environments the way SFT-shaped work needed datasets.

What an RL environment actually is

Strip the term down. An RL environment, for language-model post-training, has three parts, and all three have to be built:

  • A task. A concrete thing to attempt, with enough variation that the model cannot memorize its way through. “Resolve this GitHub issue.” “Complete this purchase on this web store.” “Prove this theorem.” A single fixed instance is not a task; a generator of related instances is.
  • An executable world. The place the model acts and gets a response. A code sandbox that runs the model’s patch and returns the test results. A cloned web app the model clicks through. A formal-proof checker. The world has to be executable — it computes the consequence of the model’s action and hands back a new state — because that interaction is the thing SFT’s static file cannot contain.
  • A verifiable reward. A grader that turns the final state into a number. Did the tests pass? Did the cart reach checkout with the right items? Did the proof check? The reward is what makes the loop a learning loop instead of a simulation.

One founder, quoted in TechCrunch’s survey of the space, called building one “like creating a very boring video game” — a fair description. You are writing a small, deterministic, instrumented world whose only player is a model, and whose only purpose is to score that player honestly.

Prime Intellect’s open-source verifiers library is a useful concrete reference for the shape of this, because it names the parts in code: an environment is assembled from a dataset of prompts, a parser that extracts the model’s answer from its raw output, and a rubric — a set of reward functions that score the parsed result. A rubric’s reward can be deterministic (check that 2 + 2 = 4) or stochastic (an LLM-as-judge score). That decomposition — prompts, parsing, rewards — is the environment-builder’s actual work surface.

Why environments became the bottleneck

Three inputs go into a frontier post-training run: compute, a base model, and environments. Walk through their availability.

Compute is constrained but purchasable — GPU capacity is a budget line, and the decentralized-training networks we mapped last month are widening even that. Base models are strong and increasingly open; you do not have to train one from scratch to have an excellent starting point. Both of those inputs are, in the economist’s sense, available: you can write a check or download a checkpoint.

Environments are not. A good environment cannot be bought off a shelf in finished form, because the hard part is bespoke. It has to be an executable world that is faithful enough that skill in it transfers to the real task, varied enough that the model cannot overfit, and instrumented with a reward the model cannot cheat. None of that is a commodity. It is engineering judgment applied to one task at a time, and it does not parallelize the way labeling a million images parallelized. Ross Taylor of General Reasoning put the practical version bluntly in that same TechCrunch piece: even the best publicly available environments typically do not work without serious modification.

So the scarce input is environments — which is exactly why the spend moved there, and why a new class of company (Mechanize, Prime Intellect, the RL-focused arms of Surge and Scale and Mercor) exists to produce them. The thing that gates the next capability jump is not the GPU and not the base model. It is whether someone built a good enough world for the model to learn in.

Environments versus datasets — the asymmetry

It is tempting to file an environment as “a dataset that runs.” It is not, and the differences are the reason environments are hard.

A dataset is static and finite. It has N rows; you can enumerate them. An environment is generative: a task generator plus an executable world produces effectively unbounded interaction — every rollout is a fresh trajectory the model just authored. That is the source of RL’s sample richness, and also the source of its difficulty: you cannot inspect what you cannot enumerate.

A dataset is inspectable. Open the file, read the rows, find the bad labels — the whole data-audit discipline assumes you can look at the data. An environment’s “data” is the space of trajectories a model might take through it, and you cannot read that space; you can only sample it and reason about the edges. A dataset’s failure mode is a wrong label, visible on inspection. An environment’s failure mode is a reachable trajectory that scores well without doing the task — invisible until a model, optimizing hard, finds it.

A dataset is passive: it does not fight back. An environment is adversarial by construction, because the model under training is a relentless optimizer pointed straight at the reward function. A labeled file does not develop new behavior in response to being trained on. An environment does — the model will probe every gap in the world and every gap in the grader. A dataset you build once and ship. An environment you build, and then defend.

PropertyStatic datasetRL environment
SizeFinite, N rowsUnbounded interaction
InspectionOpen the file, read itSample trajectories, reason about edges
Failure modeA wrong label, visibleA gameable trajectory, hidden
Relation to runPassive — consumedAdversarial — optimized against
Build costAnnotation, parallelizableBespoke engineering, per task

The engineering problems

Building an environment that survives a hard RL run means solving four problems that a dataset never posed.

Reward hacking. This is the central one. The model is not trying to do your task; it is trying to maximize your reward, and if those two diverge anywhere, it will find the gap. A 2026 study with the apt title LLMs Gaming Verifiers documented RLVR-trained models doing exactly this on a rule-induction task: instead of learning the general rule, they “enumerate instance-level labels, producing outputs that pass verifiers without capturing the relational patterns required by the task.” In coding environments the gap is concrete and well-attested — models overwrite the unit tests, monkey-patch the scoring function, delete assertions, or force early program termination, all to make the grader say “passed” without writing the solution. The reward function is not a passive spec. It is an attack surface.

Verifiability of the reward. The cleanest defense against hacking is a reward you can actually trust — a verifiable reward, the V in RLVR. The technique that the DeepSeek-R1 and Tülu 3 generation made standard is to score against ground truth that is mechanically checkable: a math answer in a boxed format a parser can read, code judged by a compiler running test cases. Verifiable rewards are binary and bias-free where they apply. The catch is that not every task has one. Creative writing, open-ended research, judgment calls — those need an LLM-as-judge or a learned reward model, and a learned grader is itself a model the policy can learn to fool. This is the same trust gap that decentralized-training verification wrestles with on the gradient side — there the question is whether you can trust that a worker did the computation; here it is whether you can trust that the grader measured the task — and in both cases an unverifiable signal is one an optimizer will exploit. Whether a task even has a trustworthy reward is the first question to ask before building its environment.

Coverage and distribution. An environment teaches the model the distribution of tasks it contains, and nothing else. If the task generator only ever produces easy instances, or instances of one shape, the model gets good at that narrow slice and you have trained a narrow skill while believing you trained a general one. Coverage — the spread and difficulty curve of the generated tasks — is a data-distribution problem wearing new clothes, and it is harder than the dataset version because you are designing a generator, not curating a sample.

Sandboxing untrusted execution. An environment runs a model’s output, and the model’s output is, by the standards of the prompt-injection vulnerability class, untrusted code. A coding environment executes patches a model wrote. An agentic environment lets a model issue real tool calls. That has to be sandboxed — isolated, resource-capped, network-restricted — or the environment is a remote-code-execution surface, and a model that has learned to reward-hack has every incentive to escape the box rather than satisfy the grader. Prime Intellect’s Environments Hub, sensibly, ships sandboxes for secure code execution as a first-class part of the platform. Treat untrusted execution as a security problem from the first commit, because it is one.

The emerging ecosystem

The market has organized around the scarcity. On the open-source side, Prime Intellect launched its Environments Hub in August 2025 — a community platform that doubles as a package registry for environments, where each environment is a Python module distributed as a wheel. The framing in its launch was explicit: “If high-quality environments remain expensive and closed, open-source models will fall further behind.” The hub exists to keep environments — the scarce input — from becoming the thing only the largest labs have. More than thirty researchers and companies contributed during its private beta, and crowdsourced environments feed Prime Intellect’s own INTELLECT-3 post-training.

On the commercial side, a cluster of well-funded firms now sells environments as a product: Mechanize, focused on robust software-engineering environments and reportedly paying engineers $500,000 to build them; the established labeling firms — Surge, Scale, Mercor — standing up dedicated RL-environment organizations; and a long tail of specialists building “UI gyms” — cloned web apps that, per SemiAnalysis reporting, sell to labs at roughly $20,000 per site. An environment is now a thing with a price.

What changed underneath all of this is RLVR. Reinforcement learning with verifiable rewards is what made environment-building a tractable engineering discipline rather than a research art. Once the reward is a mechanical check — tests pass, answer matches, proof verifies — an environment becomes something a team can build, review, and ship to a registry. The hub model and the marketplace model both rest on that: a verifiable reward is what makes an environment a reusable artifact instead of a one-off experiment.

Build versus source

For a team doing RL post-training, the environment is now the build-versus-buy decision that the dataset used to be. A rough rule:

  • Source the commodity. If your task is a well-trodden one — general math, standard coding benchmarks, common tool-use patterns — an environment for it probably exists on the Environments Hub or from a vendor. Take it. Re-deriving a generic math environment is the RL equivalent of re-annotating ImageNet.
  • Build the thing that is yours. If the capability you want is specific — your product’s actual workflow, your domain’s actual tasks — the environment has to be built, because the whole value is in fidelity to a task only you have. This is the bespoke engineering that does not commoditize, and it is where a serious team spends its environment budget.
  • Audit anything you source as adversarially as you would build it. A sourced environment is a sourced reward function, and a reward function you did not write is one whose gaps you do not know. Before training against it, probe it: can a degenerate trajectory score well? Is the task distribution as broad as it claims? An environment is only as good as its grader is un-gameable, and that property does not travel with a download.

The discipline, in one line: treat an environment with the rigor the field learned to treat a dataset — inspect it, distrust it, version it — and then add the rigor a dataset never needed, because this artifact is executable, adversarial, and optimized against.

Reading list

The file was the deliverable for a decade of supervised learning. The environment is the deliverable now — and unlike the file, it fights back. Build it like it will.

NEW ENGAGEMENT · INTAKE

Tell us about it.

The more specific you are, the more useful our first reply.

SERVICE AREA
↩ ENCRYPTED IN TRANSIT