# MCP in production: the four gaps nobody demos.

The demo always works. An agent connects to a Model Context Protocol server, lists its tools, calls one, gets a clean result, and the room nods. The connection is one process talking to one server over one session — no load balancer, no autoscaling, nobody else connected.

Then it goes to production: forty replicas behind an ALB, three hundred concurrent agents, an autoscaler that recycles cold replicas, a security team asking which servers each agent may touch. The protocol that worked perfectly in the demo starts dropping sessions, and the team blames the protocol.

The protocol is fine. "MCP works in a demo" and "MCP works in production" are different claims, and the gap is not a bug in MCP — it is a set of operational properties MCP deliberately leaves alone, because they are not the protocol's job. Four of them bite hard at scale. This is what they are and what you build around them.

## MCP won — and that is exactly the problem

The good news first, because it is real. MCP won. It is the de facto tool-integration standard, supported across every major AI lab — you do not pick a tool-calling format anymore, you implement MCP. The ecosystem reflects it: the public MCP server registry has grown from roughly a thousand servers to many thousands in about a year, a signal that MCP-backed agents are now common in production, and the Model Context Protocol project publishes a public 2026 roadmap on its blog.

But winning is itself the problem. A standard that wins gets deployed by people who were not in the room when it was designed: the demo crowd knew an MCP session is stateful and sticky, but the platform team inheriting the rollout six months later does not, because nobody wrote it on the box. Every gap below is a place where standard infrastructure assumes something MCP does not provide — and the mismatch is silent until it is an incident.

## Gap 1 — stateful sessions versus stateless load balancers

This gap causes the most production pages. An MCP session is stateful: when an agent connects, it establishes a session holding context — negotiated protocol version, initialized capabilities, subscriptions, and, for many servers, in-memory state tied to the work in progress. The session is not a token you can replay anywhere; it is a live thing pinned to one process, and requests in it are meaningful _only_ to the instance that holds it.

Standard infrastructure assumes the opposite. A round-robin load balancer sends each request to the next backend; an autoscaler adds and removes replicas on CPU. Both assume request N+1 can go to a different replica than request N with no consequence — that assumption is the entire reason horizontal scaling is cheap.

```
  Agent                Round-robin LB        MCP server replicas
   |-- initialize ---------->|---------------> replica A   session created here
   |<-- session id ----------|<---------------            (state lives here)
   |                         |
   |-- tools/call ---------->|---------------> replica B   never saw this agent
   |<-- ERROR ---------------|<---------------            "unknown session"
```

Put a stateful MCP session behind it and the failure is immediate. The session is created on replica A; the next call lands on replica B, which has never heard of it; the agent gets an error mid-task. It looks intermittent — it depends on the LB's rotation — which makes it hard to debug.

The fix is session affinity — sticky routing — made explicit, because it does not come for free. The MCP session identifier becomes a first-class routing key: a consistent hash on the session ID, cookie-based stickiness at the LB, or a session-aware proxy mapping session IDs to backends. It is mandatory before MCP touches more than one replica. And affinity trades one problem for another: load is now spread by _session lifetime_, so a few long-lived sessions hot-spot one replica while others idle — which leads to the next gap.

## Gap 2 — scaling a thing that holds connections open

A stateless HTTP service scales on requests per second divided by per-request cost. MCP does not fit that model: a server does not serve independent requests, it holds sessions, and sessions hold connections open across many calls and long idle gaps. That changes the math three ways.

**Connection limits become the ceiling, not CPU.** Every open session consumes a connection slot and a slice of memory whether or not it is doing anything this second — an agent that calls a tool, thinks for thirty seconds, then calls another holds its slot the whole time. Size the fleet on peak _concurrent sessions_, not requests per second; the two diverge sharply once agents have thinking time between calls, which, as the [agent-budgets essay](/blog/agent-budgets/) lays out, every well-built agent loop does.

**Server fan-out multiplies the count.** A non-trivial agent does not talk to one MCP server — it talks to a filesystem server, a database server, a search server, a payments server, each a separate session. One task is five sessions, five things that can be hot, cold, rate-limited, or down independently; three hundred agents against five servers is fifteen hundred live sessions.

**Cold starts land mid-task.** Add a replica to a stateless service and the cold start amortizes across thousands of future requests; nobody notices. An MCP replica starts with zero sessions, so the next agent pays the full cold-start latency — process boot, upstream connections, auth handshakes — on the critical path of a real task.

The wrapper layer treats MCP servers as a connection pool with a budget: pre-warm a baseline of sessions so cold starts do not land mid-task, cap concurrent sessions per server with a real queue, and scale on concurrent-session count rather than CPU. Standard autoscaling measures the wrong number.

## Gap 3 — governance: which servers, whose secrets, what audit trail

The first two gaps are about keeping MCP up; this one is about keeping it _safe_, and it surfaces in a security review rather than on a pager.

MCP is, by design, a way to give a model hands: a server exposes tools, tools have side effects, an agent that can reach a server can invoke them. The protocol standardizes _how_ the agent calls the tool, not _whether it should be allowed to_. Authorization, secret custody, and auditability sit above the protocol — and in most demo-derived deployments, nowhere at all. Three questions a production deployment must answer:

**Which servers may this agent call?** A research agent has no business holding a session to the payments server; a support agent has no business reaching the internal admin server. In a demo the question does not arise. In production you have many agents of differing trust levels and servers of differing blast radius, and "any agent can connect to any server" is not a security posture — it is the absence of one. You need an allowlist binding an agent's identity to the servers it may open sessions with. That is fundamentally a question about _who the agent is_ — the same identity problem the [ERC-8004 essay](/blog/erc-8004-agent-identity/) works through for agent-to-agent transactions, turned inward at your own fleet. An agent without a verifiable identity cannot be governed.

**Where do the secrets live?** MCP servers need credentials — database passwords, API keys, payment tokens. The lazy path hands those to the agent to pass through, which puts production credentials inside a model's context window, where they can be logged, leaked into a trace, or exfiltrated by a prompt injection riding in on tool output. Secrets belong to the server or a broker in front of it, never the agent: the agent presents its identity, the broker attaches the credential, the model never sees the key.

**What is the audit trail?** When an agent calls `database.execute` or `payments.transfer`, that event must be logged independently of agent and server — who called, which tool, what arguments, what result, under which session. MCP does not mandate a log. If the only record is the agent's own trace, you do not have an audit trail; you have the agent's word, and for anything regulated or with money attached, the word of the thing being audited is not evidence.

The answer to all three is one policy layer every MCP call passes through — it authenticates the agent, authorizes it against the requested server, attaches server credentials so the model never sees them, and appends the call to an independent log. That layer is not optional infrastructure you add when you have time. It is the difference between an agent platform and an unbounded remote-code-execution surface with a friendly protocol in front of it.

## Gap 4 — retry and expiry: when a session drops mid-task

The first three gaps are about steady state. This one is about the moment things go wrong — the most underspecified part of nearly every MCP deployment, because the demo never runs long enough to hit it.

Sessions drop: a replica gets recycled in a deploy, a network blip kills the connection, an idle session times out. This is the normal background failure rate of a distributed system, not an edge case — and MCP gives almost no guidance on what the agent does next; recovery semantics are largely left to the implementation.

Here is the trap. When a session drops mid-task and the agent reconnects and retries, it retries against a server that lost all session state — and does not know whether the failed call _actually failed_. Consider an agent calling `payments.transfer`: the call goes out, the session drops before the response comes back. Did the transfer happen? The agent cannot tell. Retry naively and it may transfer twice; give up and it may report failure on a transfer that succeeded. Both are wrong, and both are silent.

This is the same disease the [agent-budgets essay](/blog/agent-budgets/) diagnoses in a different organ: an agent that hits a wall and just tries again is not recovering, it is guessing. A dropped MCP session demands a recovery contract, not a retry loop:

- **Classify the failure.** A tool returning "invalid argument" failed deterministically — retrying is pointless. A dropped session failed _ambiguously_ — the outcome is unknown. The agent must branch on which it hit; most collapse both into a generic catch-and-retry.
- **Demand idempotency for side effects.** Every state-changing tool call must carry an idempotency key, so a post-drop retry is safe: the server recognizes the key and returns the original result instead of repeating the side effect. A state-changing tool without one is unsafe to call from an agent that can lose its session — which is every agent.
- **Reconstruct, then bound.** Recovery for a read is free: reconnect and re-run. Recovery for a write is a read — reconnect, _query_ whether it landed, act on the answer. Give reconnection a budget; when it is spent, emit a structured partial result naming which operations are confirmed and which are unknown. Unknown is a valid state; pretending it is success is not.

A session drop is a fork in the agent's control flow. An agent without that fork written in will, sooner or later, do a side effect twice.

## The wrapper layer you actually ship

The four gaps share one shape: MCP standardizes the _conversation_ with a tool, not the _operations_ around it — routing, scaling, authorization, recovery. Someone owns those, and that someone is a wrapper: a session-aware reverse proxy and policy broker between the agent fleet and the MCP servers, so the agent code stays clean plain MCP. Build or adopt it _before_ the first server goes behind a load balancer — then audit the deployment:

- [ ] MCP sessions routed with affinity, so a session always reaches the instance that holds it.
- [ ] Scaling and alerting on concurrent session count, not CPU or requests per second.
- [ ] Server fan-out counted — sessions per agent task — and the fleet sized for it.
- [ ] An identity-bound allowlist of which servers each agent may open a session with.
- [ ] Server secrets held by the server or a broker, never inside the agent's context window.
- [ ] Every tool call written to an audit log independent of both agent and server.
- [ ] Dropped sessions classified as ambiguous, distinct from a deterministic tool error.
- [ ] Every state-changing tool taking an idempotency key, so a post-drop retry cannot double-execute.

A deployment that cannot check all eight does not have an MCP problem. It has a demo that has not been caught yet.

## Reading list

- The [Model Context Protocol](https://modelcontextprotocol.io) specification, including session lifecycle and the transport layer. Read the lifecycle section twice; most deployments skim it.
- The [Model Context Protocol](https://modelcontextprotocol.io) project also publishes a public 2026 roadmap — it shows which of these gaps the maintainers intend to close in the protocol versus leave to implementations.
- [Anthropic](https://www.anthropic.com), which introduced MCP — the introductory write-up frames what the protocol is for, which makes the boundary of what it leaves out easier to see.

MCP won the integration problem. The operational problems were never in scope — so build the layer that owns them, before the load balancer finds out for you.