# Graceful failure for voice agents.

A caller is three minutes into a call with a voice agent. They have just finished a long sentence. The agent's text-to-speech provider, mid-reply, returns a 500. Or the language model, which has been answering in 400ms, stalls — and eight seconds pass with no token. From the caller's side, none of that detail is visible. What is visible is silence. A long, total, unexplained silence on a phone line, with a human being holding the phone, waiting.

That silence is the whole problem. In a web app, a backend that takes eight seconds shows a spinner; the user sees the system working and waits. A voice call has no spinner — no loading state, no skeleton screen, no progress bar. It has audio or it has nothing, and "nothing" on a live call does not read as "working." It reads as _broken_ — the line dropped, the agent crashed, this product does not work — and the caller hangs up, or starts saying "hello? hello?" into the void.

So a voice agent cannot treat failure the way a web service does. A web service can fail a request and return a 503; the user retries. A voice agent mid-call cannot return a 503 to a human ear. When a component fails or slows, the call is still live, the caller is still on the line, and the system's only acceptable move is to **degrade, not drop** — to keep the conversation alive and coherent while it works around the broken part.

This post is about engineering that: the per-component failure modes a voice stack actually has, the degradation strategies that cover them, the fallback ladder that ties them together, and the fault-injection testing that proves the ladder works before a real caller finds the hole. Two scope notes: this is not the architecture decision — cascade versus end-to-end speech-to-speech is [the trade study](/blog/cascaded-vs-end-to-end-speech/) — and it is not turn-taking, which is [its own discipline](/blog/voice-turn-taking/). This is what happens when something in the stack breaks while a human is listening.

## Why voice is unforgiving

Three properties of a voice call, stacked, are what make failure handling a first-class problem rather than an afterthought.

**It is real-time and synchronous.** There is no queue, no "we'll email you when it's ready." The caller is present, now, and every second of the failure is experienced live.

**There is no visual channel.** Every other kind of software can acknowledge a delay visually — a spinner, a "thinking…" indicator, a disabled button. Voice has one channel, and it is the same channel the conversation uses. The agent cannot show that it is busy; it can only speak or be silent, and silence is ambiguous in the worst way.

**Dead air is catastrophic, not merely bad.** A two-second pause in a chat is nothing. A two-second pause on a phone call is long enough for a caller to think the line dropped. Four seconds and they are saying "hello?" Eight and they have hung up and are dialing again, angry. The cost of a failure on a voice call is not proportional to its length in some gentle way — it is a cliff, and the system goes over it fast.

Put together: a voice agent has the tightest failure-handling constraints of almost any consumer-facing system, and the smallest toolkit to meet them. Everything below follows from that.

## The per-component failure modes

You cannot design a fallback for a failure you have not enumerated. A voice stack — take the cascade, STT then LLM then TTS, plus transport — has four places to fail, and each fails two ways: **down** (errors, unreachable) or **slow** (up, responding, but blowing its latency budget). Down and slow need different responses, and slow is the one teams forget, because slow does not throw an exception.

| Component           | Down — fails outright                              | Slow — blows its latency budget                             |
| ------------------- | -------------------------------------------------- | ----------------------------------------------------------- |
| STT                 | No transcript; the agent is deaf to the caller     | Transcript arrives late; the whole turn shifts past budget  |
| LLM                 | No response text; nothing for TTS to speak         | Token stall — the dead-air case; 8s of silence              |
| TTS                 | No audio; the agent is mute despite having a reply | First audio byte arrives late; reply starts noticeably slow |
| Network / transport | Call drops, or media path is severed               | Jitter, packet loss — choppy or garbled audio both ways     |

Two things to read off the table. First, a failure anywhere becomes the same symptom to the caller — dead air or a dropped call — so the system cannot rely on the caller telling it what broke; it has to detect the component-level fault itself. Second, the "slow" column is the dangerous one. A component that is _down_ fails loudly and fast — you get an error, you react in milliseconds. A component that is _slow_ gives you nothing to catch; it quietly spends the caller's patience. The LLM token stall is the canonical case: no error, no exception, just eight seconds of a model not producing a token while a human listens to nothing. Detecting slow means timeouts and budgets, not try/catch.

## Degradation strategies

Here is the toolkit. None of these is exotic; the discipline is having them all wired before a caller needs them, and knowing which covers which failure.

**Secondary providers, pre-warmed.** For STT and TTS especially, run a second vendor and fail over to it. The mechanics matter: the fallback connection should be warm — credentials valid, connection open or fast to open — because a failover that takes two seconds to cold-start has spent the entire dead-air budget on the failover itself. Accept that the backup sounds or transcribes slightly differently; a different voice mid-call is a small, survivable artifact, and the alternative is a dropped call. The trigger is typically a circuit breaker — after a handful of consecutive failures, the breaker opens and new traffic routes to the secondary, with a cooldown before it probes the primary again.

**Filler and holding speech.** This is the strategy unique to voice, and the most important one for the _slow_ failures. When the LLM is taking longer than its budget, the agent does what a human does when they need a moment: it says something. "Let me check on that for you." "One second while I pull that up." A short, natural holding phrase converts dead air — which reads as broken — into a pause that reads as the agent working, and buys real wall-clock time for the slow component to finish under cover of speech. Two cautions. It must be triggered by a latency threshold, not fired on every turn — an agent that says "let me check" before every answer is its own kind of broken. And it must not be overused: there is published evidence of voice agents with filler rates high enough to be effectively talking over the caller, which trades one failure for another. Filler is a thin cover for a real delay, not a verbal tic.

**Cached and canned responses.** For the narrow set of exchanges that recur — a greeting, a request to repeat, a common confirmation — a pre-synthesized audio response costs nothing at call time and cannot stall. It will not cover an open-ended turn, but it removes the most frequent turns from the failure surface.

**Honest acknowledgment.** When a component is down and the agent genuinely cannot answer, the move is not to hallucinate a response and not to sit silent. It is to say so, plainly: "I'm having trouble pulling that up right now — let me get someone to follow up with the exact answer." Honesty here is both better UX and safer than a fabricated answer from an agent flying blind without its knowledge source — and the caller leaves with a real next step instead of a wrong answer or a dropped line.

**Graceful handoff to a human.** The ultimate fallback, and it has to be engineered, not improvised. When the agent cannot continue, it transfers to a human — and it transfers _well_: it tells the caller what is happening ("I'm going to connect you with someone who can help"), and it hands the human a structured summary — detected intent, entities collected, what has been tried — so the caller does not start over from zero. A handoff that dumps a raw transcript on a human, or worse drops the caller into a cold queue, is a failure wearing a fallback's clothes.

Notice each strategy targets a different region of the failure table. Secondary providers cover _down_; filler covers _slow_; canned responses shrink the surface; honesty and handoff cover the case where nothing else worked. A real system needs all of them — no single one covers the whole table.

## The budget angle — failure is a blown budget

There is a unifying way to see "slow." A voice stack runs on a latency budget — [the 280ms breakdown](/blog/280ms-budget/) traces where every millisecond of a sub-300ms target goes, component by component. A component being "slow" is not a separate phenomenon from that budget. It _is_ the budget, exceeded.

Which means the trigger for graceful degradation is not a vague sense that things feel sluggish. It is a number. Each component has a budgeted latency and a hard ceiling above it; when a component crosses its ceiling on a turn, that is the signal — fire the holding phrase, or trip toward the secondary provider. This is the same posture as [agent budgets](/blog/agent-budgets/), which argues that a budget is part of an agent's spec and a budget overrun is a defined event the system must act on, never an open-ended "let it run longer." A voice agent that lets a slow stage run unbounded, hoping it resolves, has no budget — and an 8-second LLM stall is exactly what "no budget" sounds like on a phone line. Per-component timeouts, set against the budget, convert an invisible slow failure into a catchable event with a defined response. "Slow" is only detectable if you decided in advance how slow is too slow.

## Designing the fallback ladder

The strategies above are rungs. The design is the ladder — an ordered sequence, per failure, where each rung is tried before the one below it, and the system descends only as far as it must. The principle the ladder enforces is the title of this post: at every rung, **degrade, don't drop**.

Take the LLM stalling past its budget. The ladder, top to bottom:

```
  LLM exceeds latency budget on a turn
        │
        ▼
  [1] holding phrase  ── "let me check on that"
        │                buys wall-clock; caller hears the agent working
        ▼  still no response after extended budget
  [2] failover LLM  ──── secondary provider / model takes the turn
        │                slightly different style; call continues
        ▼  secondary also fails
  [3] honest acknowledgment ── "I'm having trouble with that right now"
        │                       no hallucinated answer, no silence
        ▼  agent genuinely cannot continue
  [4] graceful human handoff ── warm transfer + structured summary
        │
        ▼
  drop  ◄── never reached if the ladder holds
```

Every rung is a real, coherent caller experience — a thoughtful agent, then an agent with a slightly different voice, then an honest agent, then a competent handoff. Only past rung 4 is there a dropped call, and a correctly built ladder means a real caller never gets there. Each component failure gets its own ladder; an STT failure's leads with a secondary STT and a polite "could you say that once more," a TTS failure's with a secondary voice. The shared shape is what matters: an ordered descent of survivable states, with "drop" off the bottom of the ladder, not on it.

## Fault-injection testing for voice

A fallback ladder you have never exercised is a hypothesis, not a feature. The failure modes above will occur in production whether or not you tested them; the only question is whether the first time the ladder runs is in a test or on a real caller. So inject the faults deliberately.

Voice fault injection is chaos engineering pointed at the voice stack — disable individual components on purpose and verify the failover fires. On a regular cadence:

- **Kill each component in turn.** Take STT offline; confirm the secondary picks up and the caller hears at most a brief, covered hitch. Same for TTS, same for the LLM provider.
- **Inject _slow_, not just down.** The rung teams skip, and the one that bites. Add latency — make the LLM respond in 9 seconds, make TTS first-byte arrive in 4 — and confirm the latency ceiling actually trips the holding phrase. A ladder tested only against hard failures has never exercised rung 1.
- **Degrade the network.** Introduce jitter and packet loss on the media path and confirm audio stays intelligible and the call holds.
- **Exercise the human handoff end to end.** The escalation path is code, and untested code; run a real call all the way to a warm transfer and confirm the summary arrives with the human.

The pass condition is strict: a test passes only when the caller's experience stayed inside the ladder — degraded, coherent, never dead air past the budget, never a silent drop. "The error was logged" is not a pass. The caller cannot read your logs; they can only hear whether the agent kept talking.

## The checklist

Before a voice agent takes real calls:

- [ ] Every component — STT, LLM, TTS, transport — has its _down_ and its _slow_ failure mode enumerated, with a defined response to each.
- [ ] STT and TTS have a pre-warmed secondary provider; failover is circuit-breaker-driven and does not cold-start into the dead-air budget.
- [ ] Holding/filler speech is wired, triggered by a latency threshold — not every turn — and rate-limited so the agent never talks over the caller.
- [ ] Each component has a budgeted latency and a hard ceiling; crossing the ceiling is a detected event, not an invisible slowdown.
- [ ] When a component is down, the agent acknowledges honestly and never hallucinates an answer in place of its missing knowledge source.
- [ ] A graceful human handoff exists: the caller is told what is happening, and the human receives a structured summary, not a raw transcript or a cold queue.
- [ ] Every failure has an ordered fallback ladder ending in handoff; "drop the call" is off the bottom of the ladder, not a rung on it.
- [ ] Fault injection runs on a cadence — components killed _and_ slowed, network degraded, handoff exercised end to end.
- [ ] A degradation test passes only when the caller's experience stayed coherent; a logged error with dead air on the line is a failure.

## Reading list

- CallSphere's [voice agent failover and reliability patterns](https://callsphere.ai/blog/ai-voice-agent-failover-reliability-patterns) — circuit breakers, retry budgets inside a call, and a monthly chaos cadence for a voice stack.
- LiveKit's [handoff pattern for voice agents](https://livekit.com/blog/handoff-pattern-voice-agents) — what a graceful, summary-carrying transfer to a human actually looks like.
- InfoQ's report on [Google Cloud's chaos engineering framework](https://www.infoq.com/news/2025/11/google-chaos-engineering/) — the general discipline of injecting failure deliberately, which voice fault injection is a specific application of.

A web app gets to fail with a spinner and a retry. A voice agent fails with a human on the line and no spinner to show them. Build the ladder, then go push it off every rung yourself — before a caller does.