Human-in-the-loop checkpoints without killing throughput.

An agent is configured to ask a human before every tool call. It is a safe-sounding default, and it survives contact with production for about a week. The reviewer is now approving two hundred actions a day, almost all of them — read this file, search this index, fetch this record — entirely benign. The first morning, the reviewer reads each one. By the second, the eye has learned the shape of a safe action and skips ahead to the approve button; by the end of the week the two hundred clicks take eleven minutes and involve no reading at all. Two hundred approvals a day is not a review. It is a queue, and the human has been quietly reassigned from reviewer to throughput valve.

Then the one action that mattered that week arrives — a write that touched live customer data — and it goes through in the same reflexive click as the hundred reads around it. The gate was there. It fired. A human looked at it for the third of a second it took to register as one more row in the queue, and approved it, because the queue had trained that response over two hundred repetitions a day for five days. The checkpoint did not fail because it was missing. It failed because it was everywhere, and a control that is everywhere is a control that is nowhere.

A useful checkpoint is not “ask a human more.” It is asking a human at the few places where a human’s judgment actually changes the outcome — and this essay is about finding those places, on two axes, and wiring the agent so the hard cases actually reach the gate.

How a checkpoint fails in both directions

A human-in-the-loop checkpoint fails in two opposite directions, and naming both is the start of placing them well, because the two failures pull the same dial in opposite ways: a team that fixes its grip on one walks straight into the other.

It fails by gating too much. Every action behind an approval makes the human a bottleneck on the agent’s throughput — the agent stalls at the gate, idle, while a person context-switches over to a queue. That stall is not free: a paused agent is still holding its slot, its context, sometimes its rented GPU time, and the cost of waiting is real spend, which is why the agent-budgets essay treats latency as a budgeted resource alongside tokens. Worse than the cost, gating everything trains the reviewer. A person asked to approve two hundred near-identical benign actions a day stops evaluating them; the approval becomes a motor reflex, and a reflexive approval is not oversight, it is a rubber stamp with a person’s name on it. The dangerous action is then hidden in plain sight, surrounded by ninety-nine benign ones that taught the reviewer not to look.

It fails by gating the wrong thing. A checkpoint on a step that is cheap and reversible — a read, a search — while the step that is expensive and irreversible — a write, a payment, an outbound message — runs unattended is not a weak gate. It is a gate on the wrong door, and a gate on the wrong door provides something more dangerous than no gate at all: the documented appearance of oversight. The post-incident review finds an approval log, sees a human in the loop, and looks elsewhere for the cause.

Both failures come from one mistake: treating “human in the loop” as a quantity — more approvals, more safety — rather than as a placement problem. The amount of review is not the lever. Where it sits is, and “where” answers to two questions: how much harm an action can do, and how sure the model is.

Gate on consequence, not on step type

The first axis for placing a gate is consequence: how much harm the action can do and how reversible it is. A read of a record can be wrong and the cost is a wasted call; a destructive write can be wrong and the cost is unrecoverable. The gate belongs on the second regardless of how routine the call looks in the trace — and routine is exactly how the dangerous action will look, because the agent issues it with the same calm formatting it uses for a search.

LangChain’s human-in-the-loop design makes the mechanics concrete. A checked tool call surfaces one of four human decisions: approve the call as proposed, edit the arguments before it runs, reject it outright, or respond with natural-language guidance that sends the agent back to plan again. The design guidance is explicit that high-risk operations should be gated and safe ones should bypass approval entirely. The placement rule that follows:

Gate writes to systems of record, payments and fund movement, outbound communication to real people, deletions, and anything that crosses a trust boundary you cannot pull back across. The test is reversibility: if undoing the action requires more than repointing a pointer, gate it.
Do not gate reads, searches, retrievals, and internal computation. They are reversible, they are most of the volume, and gating them is precisely the noise that breaks the reviewer. A gate here costs throughput and buys nothing.
Gate the irreversible even when it is rare. Frequency is not consequence. A destructive action that fires once a week still belongs behind a gate; in fact the reviewer who sees that action once a week, against a backdrop of nothing else to approve, is the one reviewer in the system still actually paying attention when it appears.

Consider an agent that drafts and sends customer refund emails. Drafting is reversible — a bad draft is discarded — so it runs free. Looking up the customer’s order history is a read, and it runs free even though it happens forty times more often than a send. Sending is neither reversible nor cheap, and it gets the gate. The volume sits where it is harmless; the gate sits on the one step that reaches a real person’s inbox and a real account balance. This is the same instinct behind gating tool calls by what they can touch rather than how often they fire — the tool-design essay argues a tool’s schema should make a dangerous call hard to express in the first place, and a consequence gate is the runtime backstop for the calls schema design cannot make safe on its own.

One caveat the LangChain guidance itself raises: editing an agent’s tool arguments too aggressively at the gate can make the model re-plan and re-execute around the edit, producing a different action than either the agent or the reviewer intended. The reviewer’s job at a consequence gate is approve or reject. Co-authoring the agent’s arguments is a different and more fragile interaction.

Gate on the model’s own uncertainty

Consequence is one axis. The second is the model’s confidence — and it is the one teams almost never instrument, because it requires a number the agent does not volunteer and will not produce honestly if simply asked “are you sure.”

The KnowNo work (arXiv 2307.01928) is the clearest demonstration of doing it properly. It uses conformal prediction — a statistical procedure that converts a model’s raw scores into prediction sets with a calibrated coverage guarantee — to give a planner a trustworthy measure of its own uncertainty, and triggers human help only when that measure says the model is genuinely unsure. The guarantee is the load-bearing part: the gate fires on a quantity with a known relationship to task success, not on a vibe. The payoff is the throughput argument made precise. Calibrated gating reduced human help by 10–24% against baselines that lacked the calibration, while holding task success constant — fewer interruptions, same safety. And the failure case is just as instructive: a mis-calibrated, conservative policy in the same study needed a human on 87% of steps to reach the same success target. Same destination, wildly different toll, and the entire difference is whether the gate reads a real uncertainty signal or a crude proxy.

The crude proxy worth naming is a raw token probability. A model can be fluent and wrong — high next-token confidence over a plausible, incorrect plan — so the bare probability is not calibrated to whether the action is right. Conformal calibration is the step that turns the model’s scores into something a gate can trust. A gate triggered by step type fires on the wrong steps; a gate triggered by raw probability fires confidently on the wrong steps; a gate triggered by a calibrated “I don’t know” fires on the right ones.

This axis is also where human review genuinely earns its latency cost. The scalable-oversight work (arXiv 2211.03540) found that a human interacting with an unreliable model assistant outperformed both the model alone and the unaided human. Review adds the most value precisely where neither party is reliable on its own — and a calibrated uncertainty signal is the instrument purpose-built to locate exactly those moments, instead of spreading the human’s attention evenly across steps most of which needed none of it.

The agent will not ask — it will guess

There is a load-bearing assumption hidden inside the phrase “human in the loop”: that the agent will surface the hard cases, that when it hits something genuinely ambiguous it will stop and put the question to the human. It will not. Left to itself, an agent facing missing or ambiguous information does not pause and escalate. It guesses, fluently, and proceeds as if the guess were given.

The “Learning to Ask” work (arXiv 2409.00557) traces this to the training objective: faced with an underspecified instruction, an agent will “arbitrarily generate the missed argument” rather than flag the gap. Nothing in next-token prediction rewards stopping to ask; everything rewards a confident continuation. And the measured size of the problem is stark. The 2026 HiL-Bench benchmark (Scale AI, arXiv 2604.09408, a preprint) found frontier coding agents scoring 75–89% when handed complete information, but only 4–24% when they had to detect the ambiguity themselves and decide whether to ask — even with an ask-for-help tool sitting available. The collapse is not a competence failure on the task. It is a failure to notice that escalation was the correct move, and an unused escalation tool is not a safety mechanism.

Two consequences follow for checkpoint design. First, the trigger to escalate has to be built into the harness — a calibrated uncertainty gate, an explicit ambiguity check before execution — and never left to the agent’s discretion, because the discretion is the thing that does not work. Second, the opposite failure is just as real. HiL-Bench’s scoring deliberately punishes “question spam”: an agent that, once given an ask tool, asks constantly to cover itself, pushing every marginal decision onto the human and recreating the rubber-stamp queue from the other end. The target is an agent that asks at the right moments and only those — a property you specify and measure against a benchmark, not a disposition you can prompt in and trust.

Autonomy is a level you set, per workflow

Step back from individual gates and the real decision is structural: how much autonomy a given workflow is granted at all. That word — autonomy — is doing precise work here, and it is worth being deliberate about it. The “agentic means nothing” essay argues that “agentic” has collapsed into uselessness because it conflates a lone tool call with a fully autonomous loop, and that the fix is to name the axis you actually mean. Autonomy is that axis. A checkpoint design is, precisely, a decision about where on the autonomy axis a workflow sits — so it should be set explicitly, as a number, not left implicit in a scatter of individual gate placements.

The Cloud Security Alliance’s 2026 autonomy framework gives that number a scale, borrowing the structure of vehicle-automation levels: from Level 1, where every action needs explicit approval, through Level 3, where the agent runs autonomously inside defined boundaries and escalates only when a situation falls outside them, up to higher levels where humans merely monitor. Level 3 is the one the placement rules above are building toward — “escalate on boundary exceptions” is the entire thesis of this essay compressed into four words.

The point of the framework is that the level is a deliberate, per-workflow choice, not a global setting and not an accident of which gates someone happened to add. A read-only research agent can sit at a high level with almost no gates; the worst it does is waste a query. An agent that moves money sits low, with a consequence gate on every transfer. Running both at the same level is the mistake — whether the shared level is too high for the money-mover or too low for the researcher.

Escalation can also be layered rather than binary. The cascaded decision-making work (arXiv 2506.11887) shows a structure where a cheap model handles the routine case, defers to a stronger model when unsure, and only then to a human — the expensive reviewer reached last and least. Human review, in a well-built system, is the top rung of an escalation ladder, not the default rung and not a rung every action has to climb. This is the same discipline the multi-agent essay applies to orchestration: add a coordinating tier only where a cheaper tier provably cannot carry the case, because every tier you add is latency, cost, and surface you now own.

Where the gates actually go

  agent action
       │
       ▼
  ┌─ consequence?  ── low (read, search, compute) ──► run, no gate
  │                ── high (write, pay, send, delete) ──► HUMAN GATE
  │
  └─ confidence?   ── calibrated-confident ──────────► run, no gate
                   ── calibrated-uncertain ──────────► HUMAN GATE

An action runs unattended only when it is both reversible and the model is calibrated-confident about it; anything irreversible, or anything the model is genuinely unsure of, takes the gate. Before an agent with real-world side effects goes to production:

Every irreversible or trust-boundary-crossing action is gated; reads and internal steps are not.
The gate fires on a calibrated uncertainty signal, not on step type or a raw token probability.
An explicit ambiguity or escalation trigger is built into the harness — the agent is not relied on to volunteer the hard cases.
The escalation is laid out so question-spam is penalised as much as missed escalation.
Each workflow is assigned an autonomy level deliberately, and review is the last tier of a ladder, not the default rung.
The reviewer’s daily gate volume is low enough that each approval is still a decision, not a reflex.

Reading list

KnowNo — calibrated uncertainty as the gate trigger; 10–24% less human help at equal success: arXiv 2307.01928
Learning to Ask — agents generate the missing argument rather than asking for it: arXiv 2409.00557
HiL-Bench — frontier agents at 75–89% with full information, 4–24% when they must self-detect ambiguity (preprint): arXiv 2604.09408
Measuring Progress on Scalable Oversight — human-plus-unreliable-model beats either alone: arXiv 2211.03540
Cascaded Language Models for Cost-effective Human-AI Decision-Making — layered escalation, human as the last tier: arXiv 2506.11887
LangChain — human-in-the-loop: the approve / edit / reject / respond decision and the gate-the-risky rule: docs.langchain.com
Cloud Security Alliance — levels of autonomy for agentic AI, and “escalate on boundary exceptions”: cloudsecurityalliance.org

Put a gate on every step and you have not built oversight — you have built a queue, and a queue’s only lesson to the person working it is that the next item looks like the last one. Gate consequence. Gate uncertainty. Leave the reviewer few enough decisions that they are still awake for the one that counts.

Human-in-the-loop checkpoints without killing throughput.

How a checkpoint fails in both directions

Gate on consequence, not on step type

Gate on the model’s own uncertainty

The agent will not ask — it will guess

Autonomy is a level you set, per workflow

Where the gates actually go

Reading list

Indirect prompt injection, by the numbers.

Your learning-rate schedule silently overrides your data-curation decisions.

Approving an agent's action is not authorizing it.

Tell us about it.

Got it.