Menu
← FIELD NOTESOPINION 2026.01.30 · 7 min

Most agent demos are lying about the latency. Here is the math.

A 4-second agent looks great on stage and falls over in production. The demo has a few tricks. Once you see them, the latency claims of every other framework get a lot less impressive.

A founder showed me a demo last week. The agent retrieved three documents, drafted an email, and asked for approval. End-to-end clock: 1.8 seconds. The room was impressed. I was impressed.

I asked them to run it again on a different query. 6.4 seconds. They blamed the network. We tried a third query. 4.1 seconds. We tried a fourth, and the demo crashed because the tool call exceeded a timeout that had been raised to 60s specifically for the demo.

This is not a knock on the team. They are real engineers, the agent is well-built, and the framework underneath is the same one a hundred other people are using. What’s going on is structural. Agent demos have a latency presentation problem, and it is doing real damage when buyers extrapolate from demo time to production time.

This is a list of tricks. Once you’ve seen them you cannot unsee them.

Trick 1: the first call after a cold start is hidden

The demo opens with the agent already loaded. The model has been warmed up; the system prompt is cached; the tool list has been registered; the embedding index is in memory. None of that is visible.

When the same agent starts cold in production — every time an inactive customer comes back, every time a worker scales up, every time the cache TTL expires — you eat 800ms to 1.5s of just starting. That cost is real, it is recurring, and it doesn’t show up in the demo because the demo never starts cold.

The math: a 5-minute cache TTL on a typical Anthropic prompt cache plus a ~3-minute idle scale-down means a low-volume agent eats a cold start on roughly 30–60% of conversations. The demo’s 1.8s is the warm number. The honest number is 1.8s × 0.5 (warm rate) + 3.0s × 0.5 (cold rate) = 2.4s average. Tail looks worse.

Trick 2: the tool calls are mocked

You will not be told this. You will be shown an agent that looks up a customer, finds the latest order, drafts a refund email, and asks for approval. The whole flow takes 2 seconds.

In production, “look up a customer” is a call to a CRM API that takes 220–800ms. “Find the latest order” is a query against an ERP that takes 80–500ms, with a long tail for paginated results. “Draft a refund email” is an LLM call against a 4k-token system prompt and the customer’s full transaction history, easily 1200–2200ms.

The demo has all of these mocked or pre-cached. The real numbers will add 2–4 seconds.

How to spot it: ask to see the network tab. If the demo doesn’t have one, ask for the request log on the agent’s backend. If the time spent on tool calls is under 100ms total, the tools are not real.

Trick 3: streaming text masks slow answers

The single most effective demo trick. The agent starts printing text after 300ms. The text is the model “thinking” out loud, or describing what it’s about to do. By the time it actually does anything useful, you’re already 4–6 seconds in, but the screen has been printing text continuously the whole time.

This is fine for a chatbot. The streaming UX is genuinely better than waiting silently. But it is dishonest as a latency claim. The metric that matters is time to useful output, not time to first token. If the first 1200 tokens of the response are “Let me think about that. First, I’ll need to look up the customer’s recent order history…” then those tokens are not output; they are filler.

The fix is simple: measure time-to-action, not time-to-first-token. Action = a tool call, a structured output, a decision. Filler doesn’t count.

Trick 4: the “thinking…” spinner

A variation on trick 3, but for agents that don’t stream. The UI shows a thinking-dots spinner with a vague message (“Searching your inbox…”). The user watches the spinner and feels productive things are happening. Internally, the agent is doing nothing for 1.4 seconds because the planner is rambling.

If the spinner is up for 600ms, your real latency is 600ms more than your “demo latency” claim. Always.

Trick 5: the demo query has been seen before

Watch carefully when the demo starts. Did the presenter type the query, or did they paste it? If they pasted it, was it a real query they would have typed live, or was it a phrase that happens to retrieve the exact three documents they want to show?

Many agents demo well because the queries have been hand-tuned to flatter the retriever. Slight rewordings break it. We saw an agent that handled “what’s our refund policy” beautifully and choked on “do we give refunds.” Same intent. Different chunks retrieved. Different reasoning path. Different latency.

A demo that handles only the rehearsed queries is not a demo. It is a play.

Trick 6: one happy path, never the failure mode

What happens when the tool call returns an error? The demo never shows you. Probably the agent retries the same call. Probably it retries again. Probably it eventually emits an apology paragraph and asks the user to try again.

That retry loop is silent in the demo because the demo’s queries never fail. In production, 4–12% of tool calls fail for transient reasons. Each retry adds 800–2200ms. A demo that runs at 2 seconds shows you a tail of 8+ seconds when you put it in front of real traffic.

Ask: what does this agent do on a 429 from the CRM? On a timeout from the embedding service? On a 401 from the auth token? If the answer is “it retries” or “it logs and continues,” the production latency distribution is much worse than the demo suggests.

Trick 7: parallel calls are presented as if they cost the same as serial

This is one I see almost every time. The agent says it’s going to “look up the customer, fetch their order history, and check the warranty database in parallel.” The demo flashes three little spinners side by side, and they all complete in 600ms. Impressive.

In production, parallel calls cost the maximum of their individual latencies, plus the overhead of fanning out, plus any rate-limit waiting. If one of the three calls hits a slow path, the whole parallel batch is bottlenecked by that one call. The 600ms in the demo is the median of the fastest tool. The real number is the p95 of the slowest tool.

Take a system where the agent does three “parallel” lookups against three different vendor APIs. Demo claim: 700ms. Production p95: 2.8s. The slowest of the three APIs has a long tail driven by a backend cache miss, and it dominates every parallel batch.

Trick 8: response time excludes the user

The end-to-end timer the demo shows usually starts when the agent receives the request. It does not include the time the user took to type, the time the UI took to render the input, the round-trip from the user’s browser to the demo server, or the latency of the TLS handshake on a fresh connection.

In production, the user-perceived latency includes all of this. If the user is on a mobile network with a 200ms RTT, every round-trip is +200ms. If the page hasn’t pre-warmed a connection, the first request pays 100ms of TCP+TLS. The demo claims 1.8s; the user feels 2.5s.

Doing the math honestly

If you want to estimate the production latency of a demo, take the demo number and apply this stack:

production_latency ≈
    demo_time
  + 500ms × P(cold_start)
  + sum(real_tool_latencies) - sum(mocked_tool_latencies)
  + 200ms × P(first_TLS)
  + 800ms × P(retry_on_first_call)
  + 1500ms × P(query_outside_rehearsed_set)

For most demos I’ve seen, P(cold_start) is at least 0.3, P(retry) is at least 0.1, P(query_outside_rehearsed_set) is at least 0.4. The arithmetic comes out to: real latency is 1.8–3.5x the demo number.

This isn’t dishonesty. It’s that the demo’s incentive structure (impress, in 60 seconds, on the happy path) and production’s incentive structure (work, for everyone, on every path) are not the same.

What an honest demo looks like

An honest pitch shows two numbers. The “warm, happy path” number, and the “cold, real-world p95” number. The first is for vibe; the second is for negotiation.

It is not a coincidence that the cold-real-world number is the one we’d sign an SLA on. There is no clause in any reasonable contract about happy-path warm latency. The clauses that matter are p95 and p99 over a rolling 7-day window in production traffic.

If a vendor cannot quote you those numbers, the demo’s latency is a stage trick. Watch the demo. Ask for the second number. If they don’t have it, you’re not buying an agent — you’re buying a play.

NEW ENGAGEMENT · INTAKE

Tell us about it.

The more specific you are, the more useful our first reply.

SERVICE AREA
↩ ENCRYPTED IN TRANSIT