Multi-agent systems are usually one agent too many.
Splitting a task across coordinating agents adds context-handoff loss, compounding latency and cost, and a wider failure surface — overhead that usually exceeds the benefit. Start with one agent.

Splitting a task across coordinating agents adds context-handoff loss, compounding latency and cost, and a wider failure surface — overhead that usually exceeds the benefit. Start with one agent.

A team has a task: take a support ticket, pull the customer’s account and order history, check the relevant policy, and draft a resolution. They build it as a crew. A router agent classifies the ticket. A retrieval agent fetches the account. A second retrieval agent fetches the policy. A reasoning agent decides the resolution. A drafting agent writes the reply. Five agents, five system prompts, an orchestration layer to pass messages between them, and an architecture diagram with five boxes that looks, on a slide, like serious engineering.
It works in the demo. In production it is slower than the old version, costs four times as much per ticket, and fails in ways nobody can trace — the draft sometimes contradicts the policy the retrieval agent fetched, and no single agent’s log shows why. So a sixth agent gets added to “verify” the draft against the policy. The diagram grows another box.
The honest version of this task is one agent. One model, one context, a handful of tools — account lookup, policy lookup, draft — running a single plan-act-observe loop. It is faster, it costs a quarter as much, and when it fails the whole reasoning trace is in one place. The five-agent crew did not buy capability. It bought overhead, and dressed the overhead as architecture.
This post is about when to split a task across multiple coordinating agents and when not to — and the answer, most of the time, is not.
Multi-agent is not a mistake people make because they are careless. It is attractive for reasons that are genuinely good, and a fair argument has to start there.
Separation of concerns. Decomposing a problem into specialists is how we build every other kind of software. A router, a retriever, a planner, a writer — each with one job — reads like clean design, because in ordinary systems it is clean design. The instinct is sound; it is the transfer of the instinct to agents that needs scrutiny.
Specialization. A retrieval agent can have a tight system prompt about retrieval and nothing else. A drafting agent can be tuned for tone. Each agent’s prompt is shorter, more focused, easier to reason about in isolation than one prompt that has to do everything.
It demos well. Five labelled boxes passing messages is legible to a non-technical audience in a way a single loop is not. A multi-agent diagram looks like more was built. For a team selling a roadmap, that legibility has real value, and pretending it does not is dishonest.
Some tasks genuinely are parallel. This is the strongest argument and we will return to it. When a task decomposes into independent subtasks that do not need each other’s intermediate results, running them concurrently is not overhead — it is the right shape, and a single agent doing them in sequence is the slow version.
Hold all four. The case for multi-agent is not a strawman. The case against defaulting to it is that the first three benefits are mostly cosmetic for agents specifically, and the fourth is a real but narrow condition that most tasks do not meet — and that splitting anyway imposes four concrete costs that the diagram does not show.
This is the cost that does the most damage and shows up the least on a diagram. When agent A finishes and hands to agent B, what crosses the boundary is a message — a summary, a result, a structured object. What does not cross is everything agent A saw and decided along the way: the dead ends it ruled out, the ambiguous bit of the ticket it interpreted one way rather than another, the implicit assumptions baked into how it framed its output.
Cognition’s engineering team made exactly this argument in their June 2025 essay “Don’t Build Multi-Agents”, and their illustration is worth keeping. Asked to build a Flappy Bird clone, a system splits the work: one subagent builds the background, another builds the bird. The background subagent renders something in a Super Mario style; the bird subagent renders a bird in a visually different style. Neither was wrong given what it could see. Each made a reasonable decision in isolation, and the decisions conflicted, because no agent held the whole picture. The parent agent is left assembling two incompatible halves.
Cognition draws two principles from this, and they are the load-bearing part of the argument. First: share context, and share full agent traces, not just individual messages — a subagent needs to see how a decision was reached, not only its result. Second: actions carry implicit decisions, and conflicting decisions produce bad results. Every output an agent produces silently commits to choices the next agent cannot see and may contradict.
A single agent has this property for free. One context window holds the whole run; every later decision is conditioned on every earlier one, because they are literally in the same stream. The moment you split, you are choosing to throw that away — and the more you try to recover it by piping fuller and fuller traces between agents, the more the agents start to look like one agent with extra serialization steps.
Two agents in sequence do not take the time of one. They take the time of one, plus the second one, plus the handoff. The agent-latency-math essay is about the gap between demo latency and production latency for a single agent; multi-agent stacks that gap on itself. Each agent in a chain has its own model calls, its own tool round-trips, its own cold-start risk. A five-agent pipeline is five latency budgets in a trench coat.
Cost compounds the same way, and there is a number for it. Anthropic, in its June 2025 writeup “How we built our multi-agent research system” — a post we will lean on shortly as the honest counterargument — reports that agents already use roughly 4× the tokens of a chat interaction, and that their multi-agent system used about 15× the tokens of a chat. That multiple is not waste; it buys real capability for the task they built it for. But it is the number to put against any multi-agent proposal: the architecture does not cost a bit more, it costs multiples more, and every agent re-reads context, re-establishes its task, and re-emits a result that the next agent re-ingests.
The agent-budgets essay argues that an agent is scoped by a cost budget, a latency budget, and a depth budget declared before any reasoning starts. A multi-agent system has to satisfy those budgets across the whole crew — and a crew whose combined cost is 15× a single chat will blow a cost budget that one well-built agent would have lived inside. Splitting a task does not just risk the budget; it usually is the reason the budget is missed.
Here is the same five-agent pipeline drawn as a reliability problem rather than an architecture:
┌───────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
│ route │──►│ retrieve │──►│ retrieve │──►│ reason │──►│ draft │
│ agent │ │ account │ │ policy │ │ agent │ │ agent │
└───────┘ └──────────┘ └──────────┘ └──────────┘ └─────────┘
│ │ │ │ │
each box: a model call that can mis-decide, a prompt that can be
injected, a handoff that can drop or distort context, a tool that
can fail. five boxes ► five independent ways the task is wrong.A single agent fails when its one reasoning loop goes wrong. A five-agent pipeline fails when any of five reasoning loops goes wrong, or when any of the four handoffs between them loses or distorts something. Reliability across a chain is multiplicative: five stages each succeeding 95% of the time is a pipeline that succeeds about 77% of the time, before counting the handoffs. You did not make the system more robust by giving each agent a smaller job. You gave a stochastic system more independent opportunities to be wrong, and wired them in series so any one of them sinks the result.
It widens the security surface too. Every agent that reads external content is a prompt-injection entry point; every tool an agent holds is attack surface. Five agents is five context windows an attacker can aim at, and a poisoned result from one agent flows downstream as trusted input to the next — the tool-output injection channel, except the “tool” is another one of your own agents.
The five-agent ticket system drafts a reply that contradicts the fetched policy. Where is the bug? It could be the policy-retrieval agent fetching the wrong document. It could be the reasoning agent misreading a correct document. It could be the handoff between them summarizing the policy lossily. It could be the drafting agent ignoring a correct decision. Four candidates, four separate logs, four system prompts, and the interaction between them is where the bug most often actually lives.
The agent-observability essay makes the case that an agent needs step-level tracing — every step’s intent, tool call, result, and result status, threaded by a trace ID. That is the floor for one agent. A multi-agent system needs all of that plus a coherent trace that spans agents — and most crews are assembled without it, so each agent logs its own steps and nothing reconstructs the cross-agent story. Anthropic is blunt about the underlying difficulty: agents “make dynamic decisions and are non-deterministic between runs, even with identical prompts,” which “makes debugging harder.” Multiply that non-determinism across five coordinating agents and a handoff layer, and a bug seen once may never reproduce — the failure mode that deterministic replay exists to address, and which a multi-agent topology makes materially worse.
The argument so far is against defaulting to multi-agent. It is not against multi-agent. There is a real case, and Anthropic’s research-system writeup is the honest articulation of it: their multi-agent setup, with Claude Opus 4 as a lead agent and Claude Sonnet 4 subagents, outperformed a single-agent Claude Opus 4 by 90.2% on their internal research eval. That is not a rounding error. For the task they built it for, multi-agent won decisively.
Read why it won, though, because the conditions are specific. Anthropic names the tasks multi-agent suits: ones that “involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools.” Open-ended research is exactly that — many independent search threads, far more source material than one context holds. And Anthropic is equally explicit about the other side: domains “that require all agents to share the same context or involve many dependencies between agents are not a good fit,” and they note that “most coding tasks involve fewer truly parallelizable tasks than research.” Two frontier labs, apparently opposed, actually agree on the principle and disagree only on how often a given task meets it.
So multi-agent earns its place under three conditions — and they are conjunctive enough that most tasks fail at least one:
Information that genuinely exceeds a single context window is the one case that can force the issue even without parallelism — but reach for compaction and retrieval first. Managing what lives in one agent’s window is context engineering, and it solves more “we need more agents” problems than teams expect; splitting the task is the heavier instrument, to be used only when the working window genuinely cannot be made to hold the job.
The default is one agent. Build the single-agent version first — one model, one context, the tools the task needs, a budgeted plan-act-observe loop. Ship it. Measure it. Only when it demonstrably fails — and fails for a structural reason that maps to one of the three conditions above, not a tuning reason you have not yet exhausted — do you split.
This is the same discipline the agent-budgets essay applies to “let it think longer.” There, the wrong move is granting an agent more compute when it stalls; here, the wrong move is granting an architecture more agents when it stalls. Both are ways of adding resource to dodge a design question. And both have the same tell: if you reach for the second agent once, you will reach for the sixth, and end up with a crew nobody can trace, justified by a diagram nobody benchmarked.
A second agent should feel like a cost you are reluctantly accepting for a specific, named, measured reason — not like progress.
Before you split a task across agents, answer these. They are ordered; a “no” or “not sure” at any step means do not split yet.
| # | Question | Split only if |
|---|---|---|
| 1 | Have you built the single-agent version and measured where it actually fails? | yes |
| 2 | Does that failure trace to a structural cause — not a prompt, tool, or budget you have not yet tuned? | yes |
| 3 | Do the subtasks run concurrently without needing each other’s intermediate results? | yes … |
| 4 | … or is there a hard isolation / permission boundary that roles must not cross? | … or yes … |
| 5 | … or do the roles genuinely need different models or tools, not the same model with a new prompt? | … or yes |
| 6 | Can your tracing reconstruct one coherent story across agent boundaries, not just per agent? | yes |
| 7 | Does the multi-agent version fit the combined cost, latency, and depth budgets — at roughly 15× the token cost of a single chat? | yes |
Pass 1, 2, 6, and 7, and at least one of 3–5, and you have a real multi-agent case — build it deliberately, instrument it, and own the cost. Fail any of them and the answer is one well-built agent. Most tasks fail at least one. That is the finding.
Two agents are not twice the system. They are one system with a seam — and a seam is where context leaks, cost compounds, and bugs hide. Earn the seam, or do not cut it.

A one-word change to a system prompt can move accuracy by dozens of points, and a provider's model update can regress your app overnight. A prompt or model swap is a deploy. Give it a staged rollout and a one-action rollback path.
11 min →
The monthly inference bill arrives as one number, and nobody can say which agent, which customer, or which tool spent it. Agent cost is too variable to estimate and has to be attributed after the fact — per run, per tool, per tenant. The layer most stacks skip.
11 min →
An agent that asks permission for everything trains its reviewers to rubber-stamp, and the one dangerous action slips through in the noise. Approval gates belong on consequence and on uncertainty — not on every step. Where to put them.
12 min →