A team sits down to build a voice agent in 2026 and hits its first real decision before a line of code is written. Not the model, not the provider, not the prompt — the architecture. There are two of them, and they are genuinely different machines. One is the pipeline everyone already knows: speech-to-text, then a language model, then text-to-speech, three stages wired in series. The other is a single model that takes audio in and emits audio out, with no text in the middle — a speech-to-speech model, end-to-end.
The pitch for end-to-end is easy to feel the moment you hear one: it is fast, and it sounds alive. It hears your tone and answers in kind; it handles an interruption the way a person does. The cascade, next to it, can sound a half-step slow and a shade flat. So the instinct is to reach for end-to-end and call the cascade legacy.
That instinct is wrong, or at least premature. The two architectures are not “old” and “new.” They are a trade — and the thing you trade away when you pick end-to-end is most of what makes a voice agent operable: the transcript you audit, the component you swap, the text checkpoint where every guardrail, eval, and redaction rule you already own gets to run.
This post is the trade study. We separate what each architecture genuinely wins, put real 2026 model names against the claims, and end with a decision framework — because the right answer is not universal, and a team that picks by vibe ships the wrong machine.
The two architectures, defined
The cascade is three models in series. Speech-to-text (STT, also called ASR) transcribes the caller’s audio into text. A language model reads that text, possibly calls tools, and produces a text response. Text-to-speech (TTS) synthesizes that response back into audio. Every stage is a separate model, often from a separate vendor, connected by text. The defining property is that text exists, in full, at two points in the pipeline — the transcript of what the caller said, and the transcript of what the agent will say.
The end-to-end model — speech-to-speech, S2S — collapses all three into one. Audio goes in; audio comes out; there is no text intermediate the system is built around. Internally these models still operate on discrete tokens — Kyutai’s Moshi, an open speech-text foundation model, generates speech as tokens from the residual quantizer of a neural audio codec and predicts time-aligned text as an “inner monologue” prefix — but that internal text is a training and quality device, not an inspection surface the way a cascade’s transcript is. The defining property is the absence of a stage boundary you can reach into.
In 2026 the end-to-end option is real and shipping. OpenAI’s Realtime API is generally available, and the May 2026 release added gpt-realtime-2, its first voice model built on GPT-5-class reasoning. Google shipped Gemini 3.1 Flash Live on 26 March 2026, an audio-to-audio model with support for over 90 languages. Both are speech-to-speech: audio in, audio out, prosody preserved. The cascade, meanwhile, is not going anywhere — by most accounts it still runs the large majority of production voice agents. Two live options, genuinely different. Hence the trade study.
What end-to-end wins — the collapsed pipeline
Two things, and they are the two things a caller actually feels.
Latency, by removing the hops. A cascade pays for its modularity in serial time. Audio has to be transcribed before the LLM can read it; the LLM has to finish enough text before TTS can speak it; and every one of those handoffs is a network round trip if the stages live on different nodes or in different regions. The latency-budget math for a voice stack — which we took apart millisecond by millisecond in the 280ms budget — is largely the math of staging these hops and overlapping them. End-to-end deletes the hops. There is one model, one inference, one round trip. Reported end-to-end response times cluster well below what a naive cascade produces; a well-built cascade can still hit a sub-300ms budget through aggressive overlap, but it is hitting that budget by engineering against the staging, while end-to-end never incurs the staging in the first place.
Paralinguistic naturalness, by never going through text. This is the subtler win and the more durable one. When a cascade converts speech to text, it throws away everything that was not words — tone, emphasis, hesitation, the rising pitch of a question, the sigh before a complaint. The LLM reasons over a flat transcript; the TTS re-synthesizes prosody from scratch, guessing. An end-to-end model never makes that round trip through text, so it can hear that a caller is frustrated and answer in a calmer register, carry emphasis through, and handle overlapping speech and interruptions natively because it was trained on audio where those things happen. Moshi’s design makes the point structurally: it models its own speech and the user’s as two parallel streams, which removes the notion of a rigid speaker turn altogether. If the paralinguistic channel is the product — a coaching agent, a companion, an emotionally-aware support line — end-to-end is not a little better here. It is a different category.
Those are real wins. Neither is free.
What the cascade wins — everything with a text checkpoint
Line them up, because this is the longer column and the one teams discount.
An inspectable transcript. The cascade produces text at both ends, and that text is the single most useful artifact in an operable voice system. When the agent says something wrong, the first question is always what did it actually say — and in a cascade you have the exact string, at every turn, for every call. With end-to-end there is no text intermediate; debugging “why did it say that” means reasoning about an audio model’s behavior with no checkpoint to read. For a regulated workflow — healthcare, finance, anything with an audit obligation — an after-the-fact transcript is not a nice-to-have. It is the compliance artifact, and the cascade emits it as a byproduct.
Component-level control. Three models from three vendors means three independent swap points. The cheapest, fastest, or best-quality STT today is not the one you will want in a year; in a cascade you change it without touching the other two stages. End-to-end is, in the words of one comparison, all-or-nothing: if the model has a bad day, the whole agent has a bad day, and there is no swapping the TTS half because there is no TTS half. The market shape reinforces this — a cascade composes from a deep menu of STT, TTS, and LLM vendors, while end-to-end speech-to-speech is, in 2026, a short list of frontier labs.
Text-domain tooling — your whole existing stack. This is the one that decides most enterprise builds. Everything the industry has built for text LLMs assumes a text checkpoint: prompt-injection guardrails, output filters, evaluation suites, function-calling schemas, PII redaction. A cascade has exactly that checkpoint — the LLM’s text input and output — so all of it runs unmodified. You can screen the response for policy before a single phoneme is synthesized. End-to-end has no such seam: there is no text moment at which to run a content filter before the caller hears the words. The agent-security discipline we lay out elsewhere — treating model output as untrusted, putting deterministic checks between proposal and action — needs a text surface to act on, and the cascade is the architecture that has one.
Cost and language coverage. A cascade lets you put a small, cheap STT and a small, cheap TTS around a right-sized LLM, and pay frontier prices only where reasoning demands it. End-to-end bills every second of audio through a single frontier model. Language coverage tilts the same way: a cascade inherits the language support of whichever STT and TTS you pick, and those have been broadened for years. End-to-end coverage is improving fast — Gemini 3.1 Flash Live advertises 90-plus languages — but you are buying one model’s coverage, not composing the best per language.
The pattern under the whole column: the cascade’s modularity is its operability. Every seam is a place to observe, swap, or enforce. End-to-end’s seamlessness is exactly what removes those places.
Side by side
| Dimension | Cascade (STT→LLM→TTS) | End-to-end (speech-to-speech) |
|---|
| Latency | Good with overlap engineering; pays for staged hops | Lower; no inter-stage hops |
| Paralinguistic nuance | Flattened — prosody lost at the text round trip | Native — tone, emphasis, interruptions preserved |
| Transcript / audit trail | Exact text at both ends, as a byproduct | None — no text intermediate to inspect |
| Debuggability | Fault isolates to a named stage | Opaque — reason about one audio model |
| Component swap | Three independent swap points | All-or-nothing; one model |
| Text-domain tooling | Guardrails, evals, redaction run unmodified | No text seam to attach them to |
| Function calling | Mature — runs in the text LLM | Improving; benchmarked but newer |
| Cost | Right-size each stage; frontier price only on the LLM | Every audio-second billed through one frontier model |
| Language coverage | Composed — best STT/TTS per language | One model’s coverage (broad and widening) |
| Vendor choice | Deep menu at every stage | Short list of frontier labs |
Read the table as a shape, not a scorecard. End-to-end wins the top two rows — the two a caller feels in the first second. The cascade wins the next seven — the ones an operator lives with for the product’s whole life. That is the trade, stated plainly.
A decision framework
Do not pick by which demo felt better. Pick by walking four questions, in order. The first hard “yes” or hard “no” usually settles it.
1. Do you need a transcript — for compliance, audit, or QA? If a regulator, a contract, or your own quality process requires a record of what was said, you need the text checkpoint. That is the cascade. End-to-end can be transcribed after the fact with a separate STT pass, but now you are running a cascade’s worth of STT anyway and trusting a reconstruction. If the answer is yes, stop here: build the cascade.
2. Do you need component-level control or text-domain enforcement? If the agent calls consequential tools, moves money, touches PII, or must run injection guardrails and output filters before it speaks — all of that wants the LLM’s text seam. If you need to swap a vendor without re-architecting, same answer. A yes here points to the cascade.
3. Is paralinguistic nuance the actual product? Be honest about this one. For a transactional agent — booking, lookup, routing, tier-1 support — prosody is polish, not the product, and the cascade’s flatter delivery is an acceptable cost for everything in column two. But for an agent whose value is emotional attunement — coaching, companionship, sensitive-topic support, de-escalation — the round trip through text discards the product. A genuine yes here is the strongest case for end-to-end.
4. What is your language and cost profile? Many languages with uneven quality needs, or tight per-minute economics, favor a composed cascade. A few major languages and a willingness to pay frontier rates for naturalness widen the door to end-to-end.
The framework is deliberately ordered. Compliance and control are near-binary constraints — if you have them, they decide, and the naturalness question never gets to vote. Only when questions 1 and 2 both come back “no” does question 3 become the real fork. Most enterprise voice work answers “yes” to 1 or 2; most companion-style products answer a clean “no, no, yes.” The architecture is usually implied by the use case — the mistake is choosing before you have asked.
Where the field is heading — hybrids
The honest 2026 read is that this is not a permanent either/or. Two things are converging on a middle.
The end-to-end models are growing the capabilities that used to be cascade-only. gpt-realtime-2 is explicitly built for tool calls and corrections mid-conversation; Gemini 3.1 Flash Live posts strong numbers on an audio function-calling benchmark. The text-shaped competencies — function calling, instruction-following — are arriving in the audio-native models. They are not yet at parity with a mature text LLM on complex multi-step tool use, but the gap is closing release over release.
Meanwhile the practical production answer is increasingly hybrid: end-to-end for the parts of a call where latency and warmth dominate — the opening, the small talk, the emotional turns — and a cascade path for the turns that need a tool call, a RAG lookup, or a compliance gate. Route within the call. Use the architecture that fits the turn. This is harder to build than either pure design, and it is where serious voice teams are spending their 2026 — which is its own reason to understand both architectures cold rather than betting the product on one.
Two scope notes, because this post is only the architecture decision. How the agent decides the caller has finished speaking, handles barge-in, and places its backchannels is a problem both architectures share and neither solves for free — that is turn-taking. And what either architecture does when a stage fails or stalls mid-call — secondary providers, holding speech, graceful handoff — is graceful failure for voice agents. Pick the architecture here; build those there.
Synthesis
End-to-end speech-to-speech is faster and more natural because it deletes the text round trip. The cascade is more observable, more controllable, cheaper to right-size, and compatible with every text-domain guardrail and eval you already run — because it keeps the text round trip. There is no architecture that is both. A 2026 voice agent picks the cascade when it needs a transcript, component control, or a text checkpoint for enforcement — which is most enterprise work — and reaches for end-to-end when paralinguistic nuance is the product itself. The growing middle is hybrid, routing per turn. The error is not picking either one. The error is picking before you have asked which machine the use case actually needs.
Reading list
Pick the cascade and you are choosing a machine you can take apart. Pick end-to-end and you are choosing one you cannot — for the calls where that is the right trade.