Cost observability for an agent fleet.

The monthly inference bill is one number. This month it is a third larger than last month’s, and that is the entire resolution of the data: a single figure, larger than before. Finance asks the question that figure is supposed to support — which product, which customer, which feature drove the increase — and engineering cannot answer it. Not because the answer is buried in a slow query or a cold data warehouse. Because it was never written down. The agent fleet made several million model calls last month, and not one of them carried a tag recording what it was for, who it was for, or which step of which run it belonged to. The bill is the sum of several million unlabelled events, and a sum cannot be un-summed.

A bill is not observability. It is a total, and a total is the one view of cost that supports no decision. Stare at “up 33%” for an hour and you cannot tell whether it is a provider raising prices, one enterprise customer who quadrupled usage, a feature that shipped on the ninth, or a single agent caught in a retry loop since the twelfth. Each of those has a different fix — renegotiate, reprice the customer, roll back the feature, patch the loop — and the total is equidistant from all of them. This essay is about the layer that turns the total into answers: per-run, per-tool, per-tenant cost attribution. Why agent economics make that layer mandatory rather than a nice-to-have, why the telemetry schema for it already exists, and why most agent stacks ship without ever emitting it.

Agent cost cannot be estimated, only attributed

The reason this layer matters more for an agent fleet than for a plain API service is that agent cost is genuinely, structurally unpredictable — and that is a measured finding, not an impression. A 2026 study of token consumption in agentic coding tasks (ICLR 2026, OpenReview) measured three things that should reshape how anyone thinks about an agent bill.

The first: some runs of the same task consumed up to 10× more tokens than others. Not different tasks — the same task, run again, costing an order of magnitude more. An agent’s path through a problem is data-dependent and non-deterministic; one run finds the answer in four tool calls, another wanders through fifteen, re-reads a file, backtracks, and re-plans. The second finding follows from the first: total token consumption was very hard to predict before execution, with a correlation below 0.15 against the obvious predictors. Whatever you would naturally measure up front — prompt length, task category, input size — barely moves with the eventual cost. The third cuts against a common intuition: input tokens dominated total cost even with prompt caching in play. Chat workloads are output-heavy; agent workloads are input-heavy, because every step re-sends an accumulating context of history, tool results, and instructions.

Put those three together and pre-estimation is a dead end. You cannot budget a workload whose cost on identical inputs swings by 10× and does not correlate with anything visible in advance. The estimate would be a guess wearing a number’s clothing. What you can do — the only thing you can do — is record what each run actually cost, attributed to the run, the tool, and the tenant: observability after the fact, because forecasting before the fact provably does not work for this class of workload. That is the complement to the pre-flight budgets the agent-budgets essay argues for, and the two are not in tension. A budget is a cap declared before a run starts — a ceiling on exposure, a circuit breaker. Attribution is the ledger read after the run ends — where the money actually went. A budget without attribution stops a runaway but never tells you why it ran away; attribution without a budget explains the overspend in precise detail after you have already paid it. A fleet needs both halves: the cap going in and the ledger coming out.

Cost scales with the architecture you chose

Before getting to attribution, one fact about raw magnitude, because it determines how much there is to attribute. Anthropic’s account of building a multi-agent research system reports that an agent uses roughly 4× the tokens of a chat interaction, and a multi-agent system roughly 15×. A coordinating agent that fans work out to sub-agents, each with its own context window and its own model calls, is not 15% more expensive than a chatbot — it is an order of magnitude more, by construction.

The same write-up found something that should change how the whole subject is framed: token usage alone explained about 80% of the variance in task performance. Sit with that. Token spend is not merely the cost axis of an agent system; it is most of the performance axis as well. The thing you are paying for and the thing that determines whether the system works are very nearly the same quantity. That collapses the usual mental separation between a finance concern and an engineering concern. Cost observability is not a billing chore bolted onto the side of the system — it is a direct measurement of the variable that most predicts whether the system does its job.

And the 4×-to-15× spread carries the sharpest operational point. The single largest cost decision in an agent system is architectural: choosing a multi-agent design over a single agent is choosing, up front, a cost structure that is multiples more expensive. That choice is sometimes correct — some problems genuinely need the parallelism. But a fleet that cannot attribute cost cannot see which of its architectures landed in the expensive bracket, cannot tell whether a given multi-agent workflow earned its 15×, and so cannot make the trade-off deliberately. It just pays the number. The architectural choice and the cost consequence have to be visible together, or the choice is being made blind.

The blindness is rarely the result of a bad decision; it is the result of a quiet one. A workflow ships as a single agent. Later, one task in it is flaky, so someone splits that task to a dedicated sub-agent — a local fix, well reasoned, that touches one workflow. A month on, three more sub-agents have arrived the same way, each a sensible local fix, and the workflow is now a five-agent fan-out costing roughly 15× a chat where it once cost 4×. No one decided to triple the cost structure; the structure drifted there one defensible commit at a time, and the only artifact of the drift is a bill that went up. With per-run attribution the drift is legible the week it starts — the cost-per-run of that workflow steps up visibly with each added agent, and the team gets to ask “is this fan-out worth it” while the answer can still change the design. Without attribution the same drift is just the total creeping, indistinguishable from growth, and the question never gets asked.

Attribution is the actual hard problem

It is worth being precise about where the difficulty actually sits, because it is not in the arithmetic. The FinOps Foundation, framing cost management for AI, names it directly: the critical challenge is allocation — “identifying the consumer of the model output” — and there is, in its words, a “lack of accepted frameworks for cost allocation across multi-agent workloads.” That is the gap stated plainly by the body whose entire job is cost discipline.

The structural reason is shared infrastructure. A single model endpoint is called by many agents, on behalf of many tenants, across many runs, and the invoice it generates arrives flattened — a count of tokens and a price, with every dimension that would make it actionable already averaged away. The endpoint sees tokens. It does not see that these nine thousand belonged to a research agent’s synthesis step for tenant 42, and those two thousand to a different tenant’s failed retry. That structure is gone by the time the bill is computed, and it cannot be reconstructed from the total afterward, because the information was never in the total to begin with.

So attribution cannot be a reporting step. It has to be added on the way in, as telemetry emitted at the moment each call is made, while the context that gives the call meaning is still in hand — which run, which step, which tool, which tenant. The unit economics are trivial once that data exists: cost per token and cost per inference are simple multiplications. But a unit cost is only useful sliced along the dimension a decision needs — per run to find the expensive executions, per tool to find the expensive capabilities, per tenant to find the expensive customers. None of those slices exists unless the system emitted the dimension to slice on, at call time, on every call. Miss it there and no amount of later querying recovers it.

The schema already exists — most stacks just do not emit it

Here is the part that should be encouraging: you do not have to invent the telemetry format, or design a tagging convention, or argue one into existence team by team. OpenTelemetry’s semantic conventions for generative AI already define it, vendor-neutrally.

The conventions specify an invoke_agent span carrying gen_ai.agent.id, gen_ai.agent.name, and gen_ai.agent.version; an execute_tool span carrying gen_ai.tool.name; a gen_ai.conversation.id to correlate a session or a tenant across many spans; and token-usage attributes recorded with real granularity — gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and, kept deliberately separate, gen_ai.usage.cache_read.input_tokens and gen_ai.usage.cache_creation.input_tokens. The metrics spec adds one rule that matters directly for billing accuracy: when a system can report both used and billable token counts, instrumentation must report the billable one. That single rule is the difference between a dashboard that roughly tracks the invoice and one that reconciles to it.

The shape of a properly instrumented run is a span tree, and the tree is the attribution — every node already carrying the identifiers the slices need:

  invoke_agent  agent.id=research-07  conversation.id=tenant-42  ── 18,400 tok
   ├─ chat            model call                                ──  9,100 tok
   ├─ execute_tool    tool.name=web_search                      ──  2,300 tok
   ├─ execute_tool    tool.name=doc_retrieve                    ──  1,050 tok
   │   └─ cache_read.input_tokens                               ──  6,200 tok  (billed at cache rate)
   └─ chat            model call (synthesis)                    ──  5,950 tok
  roll up by agent.id → per-agent cost   by tool.name → per-tool   by conversation.id → per-tenant

Read the tree and the three questions answer themselves. Sum the nodes under one conversation.id and you have that tenant’s cost; group by tool.name across all trees and you have per-tool spend; the per-run total is the root. Nothing needs reconstructing because nothing was ever flattened — the structure that the bill destroys is preserved, span by span, at the moment of each call.

One honest caveat: these GenAI conventions are still at “Development” stability, which means attribute names can still change between spec versions. That instability is part of why stacks skip the work, and it is a fair thing to acknowledge. But it is not a reason to skip the instrumentation — it is a reason to pin the convention version you build against and own the upgrade deliberately when it moves. A schema that might be renamed is a vastly better foundation than a bill that cannot be sliced at all. The schema exists. It is vendor-neutral. The only reason a fleet’s bill is still one number is that nobody wired the spans in.

What to attribute, and what each slice tells you

With the span tree flowing, three slices each retire a standing question that a total can never touch.

Per run. Which executions cost 10× the median — the loops, the runaway retries, the pathological inputs that send an agent wandering. The run-to-run variance the coding-agents study measured is completely invisible in an average; it lives entirely in the long tail, and the long tail is only visible when each run is a costed object you can sort. The expensive run is not a rounding error spread thinly across the fleet. It is a small number of executions carrying a large share of the bill, and finding them is a sort, not an investigation.
Per tool. Which tool calls dominate spend, and — because input tokens dominate — where context is being re-sent. A tool that looks cheap per call can be the most expensive line in the fleet if every invocation drags a large context along with it. That is the context-engineering discipline the context-engineering essay describes, finally turned into a number you can watch instead of a principle you hope is being followed.
Per tenant. Which customer the spend actually served. This is the slice that makes per-customer margin a real figure rather than a fleet-wide average, the slice that tells you a flat-priced plan is being consumed at ten times the rate it was priced for — and the exact slice finance asked for in the opening. It does not exist without a conversation.id on every span tying the run to a tenant.

Treat cache hits as a first-class attributed dimension, not an afterthought folded into a total. cache_read and cache_creation tokens bill at different rates, and a fleet that cannot see its cache-hit rate per agent is blind to one of its largest and most controllable levers — a prompt restructured to be cache-friendly can move the bill materially, but only if the cache dimension is visible enough to show the improvement landed.

The checklist

The instrumentation below is not a large project, but it has to be done at the point of emission — every item is something the code must record while a call is happening, because nothing on this list can be recovered from the bill after the fact. Treat it as the definition of done for a fleet’s telemetry, and verify each line against real exported spans rather than against the intention to emit them.

Every model call is wrapped in an invoke_agent span carrying agent id, name, and version.
Every tool call emits an execute_tool span with gen_ai.tool.name.
A conversation.id ties each run to a tenant or session.
Token usage is recorded as input, output, and cache-read / cache-creation separately, and billable tokens are reported when they differ from used tokens.
Cost rolls up cleanly per run, per tool, and per tenant — each is a standing dashboard, not a query someone writes once and loses.
The OpenTelemetry GenAI convention version is pinned, because the spec is still at Development stability and moving.

Reading list

How Do Coding Agents Spend Your Money? — the 10× per-run variance, sub-0.15 predictability, and input-token dominance: OpenReview
Anthropic — how we built our multi-agent research system; agents at ~4× and multi-agent at ~15× chat token use: anthropic.com
OpenTelemetry — semantic conventions for generative AI systems, the vendor-neutral telemetry schema: opentelemetry.io
OpenTelemetry — GenAI agent and framework spans; invoke_agent, gen_ai.agent.id: opentelemetry.io
OpenTelemetry — GenAI metrics; the report-billable-tokens rule: opentelemetry.io
FinOps Foundation — FinOps for AI; allocation as the critical, unsolved challenge for multi-agent workloads: finops.org

The total at the bottom of the invoice is the one number a fleet already has and the one number it cannot act on. Every dimension that would make it actionable has to be attached span by span, while the call is happening — because the bill will never hand back what the spans did not record.

Cost observability for an agent fleet.

Agent cost cannot be estimated, only attributed

Cost scales with the architecture you chose

Attribution is the actual hard problem

The schema already exists — most stacks just do not emit it

What to attribute, and what each slice tells you

The checklist

Reading list

Indirect prompt injection, by the numbers.

Your learning-rate schedule silently overrides your data-curation decisions.

Approving an agent's action is not authorizing it.

Tell us about it.

Got it.