Voice agents have a hard latency target. Below 300ms end-to-end (user stops talking → first audible response), the conversation feels normal. Between 300ms and 600ms it feels stilted. Above 600ms it feels broken, and callers hang up. There is no soft middle. Either you’re under the threshold or you’re not.
The 280ms in the title is the p95 budget a well-built voice intake is architected to hold. Median around 198ms, p99 around 340ms. These are the numbers a correctly-built pipeline targets at real phone-tree traffic volumes — here is the architecture that gets you there.
This post traces every millisecond a packet spends in that pipeline. If you’re trying to build a voice agent and you don’t know where 280ms is hiding, this is the breakdown.
The stack at a glance
┌─────────┐ audio ┌────────┐ text ┌───────┐
│ caller │ ─────────► │ STT │ ────────► │ LLM │
└─────────┘ └────────┘ └───────┘
▲ │
│ audio ┌────────┐ text │
└────────────│ TTS │ ◄───────────────── ┘
└────────┘
Components:
- Transport: WebRTC via LiveKit. The phone-trunk leg uses a SIP-WebRTC bridge.
- VAD (voice-activity detection): server-side, Silero VAD running on the same node as the STT.
- STT (speech-to-text): Deepgram Nova-3 in streaming mode.
- LLM: Claude Haiku 4.5 for the fast path; Sonnet 4.6 for queries flagged complex by a classifier (under 6% of turns).
- TTS (text-to-speech): ElevenLabs Flash 2.5 with streaming output.
- RAG: pgvector retrieval running in-process with the LLM service. Retrieval is overlapped with first-token generation; see below.
Three nodes total in the hot path: a media gateway, an inference node, and a TTS node. All in the same AWS region, all on 25-Gbps networking, all with persistent gRPC connections kept warm.
The 280ms is the p95, not the average
I want to make this part loud, because everyone gets it wrong.
Voice latency is not a number. It is a distribution. The median is meaningless on its own. The p50 of this design is 198ms; the p95 is 280ms; the p99 is 340ms. A caller who experiences a single 1.2-second response in a 6-minute call will hang up. The whole system is rated by its tails.
A latency budget needs to be a per-percentile budget. A reasonable one, roughly:
| Percentile | Budget |
|---|
| p50 | 220 ms |
| p95 | 280 ms |
| p99 | 360 ms |
| p99.9 | 500 ms |
If you only optimize the average, you ship a system that has a great median and 8% of calls feel broken. Those 8% will dominate your reviews, your churn, and your CSAT score.
The packet’s life, in order
I will walk through what happens between the moment the caller stops talking and the moment they hear the first phoneme of the reply. I’ll budget every step.
Step 1 — Endpoint detection (VAD): 60–120ms
VAD does not detect “user stopped talking” the instant the audio goes silent. It cannot — natural speech has 100–200ms intra-word pauses, and a VAD that fires on every pause produces an interruption-prone, jittery system that cuts the user off mid-sentence.
The VAD therefore needs to wait some amount of silence before declaring an endpoint. Too short and you interrupt the user. Too long and you add direct latency.
We use Silero VAD with a silence_duration_ms of 110ms. That is on the aggressive side; the default in most stacks is 400–500ms. We can run it that low because we pair it with:
- A “speech activity probability” threshold at 0.45 (a little below the Silero default of 0.5), so we don’t enter the silence-counting state on legitimate pauses with breath/aspiration noise.
- A “minimum speech duration” of 200ms, so a single cough or “uh” doesn’t trigger a turn.
- A second-stage filter that listens for 60ms of audio after the VAD’s endpoint signal and cancels the turn if speech resumes.
The second-stage filter is the one that lets us be aggressive without cutting people off. The cost is 60ms of extra latency on every turn — but only after the VAD already fired. It’s overlap, not addition.
Net VAD cost in the hot path: 60–120ms. We treat 90ms as the budgeted value for p95 accounting.
Step 2 — STT finalization: 30–110ms
Deepgram’s streaming mode emits interim hypotheses every ~50ms while the user is talking. When the VAD signals endpoint, we ask Deepgram for the final hypothesis. Deepgram has been receiving audio the whole time; the finalize call returns the locked-in transcript with the language model applied.
Deepgram’s finalize call is fast — typically 30–60ms — if the trailing audio is unambiguous. If the last word is acoustically ambiguous (e.g., “fifteen” vs. “fifty”), the language model runs an extra rescoring pass and the latency climbs to 90–110ms.
Two things help.
First, send STT the post-VAD audio aggressively. The first 60ms of “post-end” audio that the VAD’s second-stage filter is examining anyway can be streamed to Deepgram too. Often Deepgram has already finalized by the time the VAD confirms end-of-turn — in which case the finalize call is nearly instant.
Second, configure Deepgram for the domain. We pass a keywords parameter with domain vocabulary (carrier names, claim codes, policy types) and Deepgram’s biasing reduces ambiguity-driven latency by maybe 20% in the traces we’ve measured.
Net STT cost: 30–110ms, budgeted as 60ms for p95.
Step 3 — Network hop, gateway → inference: 4–8ms
The media gateway and the inference node sit in the same AZ. Round-trip with gRPC streaming over a kept-alive HTTP/2 connection: 4–8ms.
This number is not free — it’s small because the design pays for it. If your gateway and your inference service live in different AZs (let alone different regions), you’ll pay 12–40ms on this hop alone. Poorly-placed deployments routinely sit at 80ms. Co-locate or pay the bill.
Net network cost: 5ms budgeted.
Step 4 — RAG retrieval (overlapped): 0ms apparent
This is where most teams miss free time.
The STT emits interim transcripts every ~50ms during the user’s speech. The interim transcript at the moment of endpoint is typically ~92% identical to the final transcript across production traffic. That is more than enough signal to start the retriever.
So we do. The instant the VAD’s first-stage signal fires, we kick off a retrieval against pgvector using the interim transcript as the query. That retrieval takes about 28ms on a reference index (~4M vectors, IVFFlat with HNSW reranking). It runs in parallel with: (a) the VAD’s second-stage 60ms filter, (b) the STT finalize call, (c) the network hop.
By the time we have the final transcript, retrieval has completed. We do a final 6ms reranking step on the actual final transcript in case it differs from interim. If it does and the top result has changed, we use the new one; if not, we use the cached top-K.
Net retrieval cost in the critical path: 0ms apparent (it’s overlapped). Wall-clock cost: 34ms, eaten by other components.
Step 5 — LLM first-token latency: 90–160ms
This is the biggest single line item and the one teams obsess over most.
The end-to-end metric you care about is first-token latency, sometimes called TTFT. It is the wall-clock time from when you POST the request to when the first token of the response arrives back. It is not the time-per-token; it is the time to start producing tokens at all.
For Haiku 4.5 on a streaming completion with a ~2000-token system prompt (cached), ~400 tokens of retrieved context (not cached), and a ~80-token user message:
- Cached prefix prefill (system prompt, conversation history): 30–50ms
- Uncached prefill (retrieved context + user message): 50–90ms
- First-token generation: 12–25ms
Total: 90–160ms, budgeted as 130ms for p95.
The number to watch like a hawk is the uncached portion. Every token that isn’t cached eats prefill time at roughly 1ms per 8 tokens. Four hundred uncached tokens is ~50ms; 1200 uncached tokens is ~150ms. RAG inserts uncached tokens by definition (different chunks per turn), so RAG turns are always slower than no-RAG turns.
Three optimizations matter.
First, cache the system prompt. The Anthropic SDK supports prompt caching with a 5-minute TTL via the cache_control field on a content block:
messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=200,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=[...],
)
After the first turn, the system prompt is a cache hit. Prefill for the cached portion drops to 30–50ms regardless of length. This pattern supports 8k-token system prompts at negligible prefill cost because every turn except the first one is a cache read. The 5-minute TTL is generous for a phone call; a single call lasting longer than 6 minutes is unusual.
Second, keep retrieved context short. A typical config retrieves 5 chunks at 256 tokens each = 1280 tokens. That’s a 160ms prefill. You could retrieve 3 chunks at 256 tokens = 768 tokens (100ms prefill, 60ms saved). Is the precision drop worth 60ms? For a high-stakes claims workflow, the answer is no — keep 5 chunks. For lower-stakes lookup-style traffic, it might be yes. You won’t know without measuring.
Third, stream the response and start TTS on the first sentence. More on this in step 6.
Net LLM cost in the critical path: first-token latency, ~130ms.
Step 6 — TTS first-byte latency: 80–160ms
TTS does not need the full LLM response to begin. ElevenLabs Flash 2.5 (and most streaming TTS providers) accept a stream of text tokens and emit audio bytes as soon as they can synthesize a complete prosodic unit — typically the first phrase or sentence.
In our pipeline, the LLM streams text out, and the moment we see a sentence-ending punctuation mark (., ?, !) or a comma followed by ~20 characters, we cut a synthesis request to ElevenLabs. The first audible byte comes back in 80–160ms depending on the length of the first phrase.
This is the only component for which we have not been able to push the median below 100ms. ElevenLabs Flash 2.5 has a hard floor at roughly 70ms even on optimal inputs.
If we don’t start TTS until the full LLM response is done, we’d be adding the LLM’s total generation time (200–600ms for a 30–80 token answer) to the budget. Streaming TTS off a streaming LLM is non-negotiable.
Net TTS cost: 130ms budgeted as the time from “first text token from LLM” to “first audio byte from TTS.”
The tricky part: TTS doesn’t start until the LLM has produced enough text. So the TTS’s first-byte time is measured from the LLM’s first-token time, not from the original audio. The total end-to-end is:
end_to_end = VAD + STT_finalize + network + LLM_first_token + TTS_first_byte
= 90 + 60 + 5 + 130 + 130 (roughly)
= ~415ms
That is way over our 280ms budget. Something is wrong with this accounting.
What’s wrong: the components overlap
The components I just enumerated do not run end-to-end in series. They overlap. The actual critical path is shorter than the sum because:
- STT finalize is overlapped with the VAD’s second-stage filter.
- Retrieval is overlapped with STT finalize and the network hop.
- TTS first-byte is overlapped with LLM token streaming.
The actual critical path is:
VAD ─────►─┐
│ STT_final ─►─┐
│ retrieval ──►┤
│ └─► LLM (cached prefill ─► first token ─► early phrase)
│ │
│ ▼
└──────────────────────────────────────────► TTS (streamed) first byte
The bottleneck depends on which component runs longest from “VAD end-of-speech” to “TTS first audible byte.”
In practice:
- VAD’s second-stage filter (60ms) overlaps with STT finalize (60ms). They start at the same instant.
- Retrieval (28ms) starts at VAD’s first-stage signal, so it’s done before STT finalize.
- LLM prefill begins as soon as the final transcript is ready (≈ max(VAD_second_stage, STT_final) = 60ms).
- LLM streams first token at +130ms after that. (System prompt cached.)
- TTS emits first audio byte at +130ms after first LLM token.
End-to-end critical path: max(60, 60) + 130 + 130 = 320ms.
That’s still over 280. Where does the rest come from?
Where the saved 40ms comes from
Two further overlaps.
First, the TTS first-byte does not have to wait for the LLM’s first token. It has to wait for the LLM’s first speakable phrase. The LLM streams tokens at ~12ms each. A 6-token phrase like “I can help with that.” takes 72ms to fully emit. But the TTS engine can start synthesizing the moment the first few tokens form a phoneme-stable prefix.
In practice, the TTS first-byte clock starts at the LLM’s second or third streamed token, not the first. That gives us 12–25ms of overlap.
Second, the VAD second-stage filter does not wait the full 60ms in expectation. The filter cancels the turn if speech resumes; if speech does not resume (the common case), it has already started downstream work during the 60ms. We’re not paying the 60ms in series — we’re paying it as a max() against work that’s already underway.
Both overlaps shave maybe 30–40ms off the critical path. The p95 lands at 280ms.
How to keep this stable
Latency budgets aren’t a one-time exercise. They drift. Three drift sources to watch for:
- Model upgrades. A Haiku version bump can shave 40ms off p95 — or add 90ms if the provider pushes an upstream change you didn’t notice. Alert on p95 latency, not just on errors.
- Retrieval bloat. Every quarter the corpus grows. The retriever doesn’t get slower per-query, but the embeddings cache evicts more aggressively, and the reranking call sees more candidates. p95 retrieval can drift from 28 to 41ms over six months before anyone notices. Re-tune the IVFFlat probes.
- TTS voice changes. A new voice profile (different speaker model) can add 22ms to TTS first-byte because of slower warm-up. The latency may still be under budget — which is exactly why you’ll miss it. Headroom is the asset; protect it.
We instrument every component and dashboard the per-component p95 separately. When end-to-end drifts, we can tell which component is responsible within minutes.
The five numbers to instrument on every turn
Five numbers. Per turn. Streamed to OpenTelemetry, displayed on a single grafana board.
- VAD endpoint duration
- STT finalize duration
- LLM first-token duration
- TTS first-byte duration
- End-to-end (VAD endpoint → TTS first byte)
Per-call rollups:
- p50, p95, p99 of end-to-end
- Tail count: number of turns over 500ms
- Tail count per cause: which component was the bottleneck
The tail-by-cause histogram is the single most useful artifact. When 4% of turns are slow, you don’t fix the average; you fix the cause that dominates the tail. Sometimes that’s the LLM; sometimes it’s the TTS; sometimes it’s a retrieval cache miss; sometimes it’s a stale gRPC connection that needed to be reaped. They all look the same from the outside.
Things we tried that didn’t work
A short list of optimizations that sound smart and aren’t.
Pre-fetching the most likely LLM response. The idea: while the user is still talking, run a low-token completion based on the interim transcript so the LLM is “warmed up.” This sounds great. In practice, the warming is wasted on every turn because the final transcript usually changes the prompt enough that the cached prefix differs. The cost (an extra LLM call per turn) doesn’t pay for the latency saved.
Switching to a smaller TTS model. ElevenLabs Flash 2.5 is already the fast tier. We tried a self-hosted alternative; the first-byte was 30ms faster but the voice quality was bad enough to be a non-starter for customer-facing calls. Nobody on the call cares about your 30ms if the agent sounds robotic.
Skipping the rerank step. Saves 6ms. Retrieval precision dropped enough to materially hurt the model’s answer quality, and the LLM started “thinking out loud” more (longer answers, more hedging), which cost us more in TTS time than the rerank saved.
Caching common queries. Voice queries are too long-tail. In test traffic the hit rate was about 3%. Maintenance burden was not worth it.
A recipe, not a target
The 280ms is not a number to chase blindly. Different voice products tolerate different latencies. A voice IVR for an emergency-services dispatcher needs sub-200ms p95. A voice chatbot for a smart speaker can tolerate 500ms because the user is already used to it. A voice agent in a kiosk can tolerate 700ms because the visual feedback masks it.
But every voice product has some threshold below which the conversation feels natural and above which it doesn’t. Find your number. Then break it down by component. Then measure every component on every turn. Then fix the tail, not the median.
When teams ask us to “make the voice agent faster,” what we do is exactly the above. There is no magic. There’s a stopwatch, a flame graph, and a list of components, and you go one component at a time.