Notes on agent budgets: why "let it think longer" is a bug.

A pattern shows up in maybe a third of the agent demos I’m forwarded. The agent tries a tool. It fails. The agent tries a slightly different tool. That fails too. The agent tries the first tool again with different arguments. That also fails. The agent then “reflects” — emits a paragraph about how it should think more carefully — and tries another permutation of the same tool. At some point the wrapper times out and the demo cuts.

The team running the demo will tell you the agent is “still learning.” It is not. There is no learning happening. The agent is in a loop, the wrapper is hiding it, and the system has no notion of when to stop.

This is not an agent problem. This is a budget problem. The agent has no budget. It does not know what a budget is. The system was designed under the assumption that more thinking is always better, which is false.

Budgets are part of the spec

A well-built agent is scoped by three budgets, declared before any reasoning starts. Every step the agent takes is checked against all three.

Cost budget: total dollars (or model-token-equivalents) the agent is allowed to spend per task.
Latency budget: total wall-clock time the agent is allowed to take per task.
Depth budget: total number of tool calls (or planning iterations) the agent is allowed to take.

If any one of these exceeds its budget, the agent is required to halt and produce a structured partial result — not retry, not reflect, not “let it think longer.”

These three numbers are determined by the use case, not by the model. A real-time dispatch agent has a 4-second latency budget and a 20-cent cost budget. A nightly research assistant has a 2-hour latency budget and a $4 cost budget. The same model serves both, with different orchestration.

If you cannot tell me your three budgets, you do not have an agent. You have a chatbot that calls functions.

What “let it think longer” gets wrong

The temptation to extend the budget when an agent struggles is strong. Almost every team building agents this way reaches for it. The reasoning sounds rational: the model is almost there; one more step and it will figure it out. Let it think longer.

It will not figure it out. Across thousands of agent traces we’ve reviewed, the pattern is overwhelmingly: when an agent fails to make progress in the first three or four tool calls, additional calls do not improve the outcome. The shape of the trace deteriorates. The agent tries the same thing in different wrappings. It calls a tool, gets a 400, calls the tool again with the same arguments, gets the same 400, “reflects,” and concludes that the tool is broken.

Models do not autonomously generate better hypotheses by being granted more compute. They generate more hypotheses, drawn from the same distribution. That is sampling, not reasoning. If the gold path is not in that distribution, you get more wrong answers at higher cost.

The fix is not to think longer. The fix is to (a) escalate to a human, (b) hand off to a specialized sub-agent with different tools, or (c) emit a structured failure with a partial result. All three are legitimate. “Try again” is not.

Concrete shape of a budgeted agent

Here is the loop we build, roughly:

async def run(task, budgets):
    state = init_state(task)
    spent = {"cost": 0.0, "ms": 0, "calls": 0}
    deadline = time.time() + budgets["ms"] / 1000

    while not state.done:
        if (spent["cost"] >= budgets["cost"]
                or time.time() >= deadline
                or spent["calls"] >= budgets["calls"]):
            return emit_partial(state, spent, reason="budget")

        step = await plan_one_step(state, remaining=remaining(spent, budgets))
        result, cost, ms = await execute_step(step)
        spent["cost"] += cost
        spent["ms"] += ms
        spent["calls"] += 1
        state = update(state, step, result)

    return emit_final(state, spent)

Two things worth noticing.

First, remaining(spent, budgets) is passed into the planner. The planner sees how much budget is left. The agent’s plan is conditioned on the budget. A planner with 60% budget remaining picks a different next step than the same planner with 5% remaining. This isn’t a clever trick; it’s just acknowledging that the planner cannot do its job without knowing what resources it has.

Second, when the budget is exhausted, the agent does not retry. It emits a partial result: what it found, what it tried, why it stopped. The downstream consumer — a human or an orchestration layer — gets to decide what to do.

The partial-result contract

The single most underspecified part of nearly every agent system in production is what happens when things don’t go well.

Most agents in production emit one of two outputs: a happy-path JSON object on success, and an apology paragraph on failure. The apology paragraph is useless. Nobody downstream can parse it. The orchestration layer can’t escalate gracefully. The end user gets a wall of “I’m sorry, I was unable to…” text and decides the system is broken.

What we build instead is a structured failure schema that mirrors the success schema. Something like:

{
  "status": "partial",
  "reason": "budget_exhausted",
  "spent": { "cost": 0.18, "ms": 2840, "calls": 6 },
  "best_effort": { ... whatever the agent did manage to determine ... },
  "next_actions": [
    { "type": "human_review", "what": "Confirm the carrier code for shipment 0418-Q." },
    { "type": "retry_with", "tool": "tms.search", "args": {...} }
  ]
}

Three properties of this schema matter:

status is machine-readable. Downstream code branches on it.
best_effort is not optional. Even when the agent could not finish, it produces what it has. Half a route is a useful starting point for a dispatcher. Half a research summary is a useful starting point for a PM.
next_actions are concrete enough to execute. Not “consider escalating”; “call human_review with this argument.”

Agents that emit this contract integrate cleanly. Agents that emit apology paragraphs do not.

When a budget is too small

Sometimes the budget is the problem. The task genuinely cannot be done inside it. Two responses are correct:

Tighten the task scope. “Find all carriers serving SF→DEN this week” is harder than “Find any carrier serving SF→DEN this week”; the first is a sweep, the second is a lookup. Different budgets.
Switch to a different agent topology. A single agent on a 20-call budget is often weaker than two agents at 10 calls each, where the first does retrieval and the second does synthesis. The pipeline gets you depth at constant cost.

What is not correct: raise the budget and hope. If you raised the budget once, you’ll raise it again. Eventually the agent is allowed to spend an hour and twenty dollars on a single dispatch decision and your unit economics are gone.

What to look for in a trace

When reviewing an agent in production, run a quick check on the last 200 traces. Four signals:

p95 calls per task. If the median task takes 4 calls and the 95th percentile takes 18, you have a long tail of agents that are spiraling. Fix the spirals.
Repeat-tool ratio. Of the tool calls in each trace, how many call the same tool with the same arguments as a previous call in the same trace? Anything above 5% is a signal that the agent has no memory of what it just did.
Reflection ratio. What fraction of calls are “reflection” steps (no external tool call, model talks to itself)? Above 20% suggests the agent is rambling instead of acting.
Partial-result rate. Of the tasks that didn’t produce a happy-path result, how many produced a structured partial? If the answer is “none, they all produced apology paragraphs,” you do not have a robust agent.

None of these are sophisticated. All of them are skipped routinely.

Read this if you take one thing away

Budgets are part of the spec. They are not a fallback. They are not an emergency brake. They are the constraint under which the agent is designed.

“Let it think longer” sounds like generosity. In practice it is a way to defer the decision about what the agent is supposed to do when it can’t make progress. The decision is unavoidable. You either make it on purpose, at design time, or you make it accidentally, at runtime, by letting the agent burn money in a loop.

The agents we build don’t think longer. They think shorter, and when they hit a wall, they say so.

Notes on agent budgets: why "let it think longer" is a bug.

Budgets are part of the spec

What “let it think longer” gets wrong

Concrete shape of a budgeted agent

The partial-result contract

When a budget is too small

What to look for in a trace

Read this if you take one thing away

Indirect prompt injection, by the numbers.

Your learning-rate schedule silently overrides your data-curation decisions.

Approving an agent's action is not authorizing it.

Tell us about it.

Got it.