Turn-taking is the hard part of voice agents.

Here is a voice agent that transcribes flawlessly. Every word the caller says lands in the transcript correctly, accents and domain jargon and all; the speech-to-text model is, by any benchmark you point at it, excellent. And the agent feels broken. It cuts the caller off mid-sentence. It sits in dead air for a beat too long after they finish. When the caller murmurs “mm-hm” to show they are following, the agent stops, treats it as a turn, and starts answering a question nobody asked.

None of that is a transcription failure. The words were all correct. What failed is everything around the words — the timing. The agent does not know when the caller is done, when it is being interrupted, or when a sound from the caller is a turn versus a nod. That cluster of problems is turn-taking, and it is the part of voice agents that is genuinely not solved.

It is worth being precise about why. Speech-to-text has had a long, well-funded decade; in 2026, on clean audio, transcription accuracy is good enough that it is rarely the thing breaking a voice product. Turn-taking has had a fraction of that attention and is structurally harder — it is a real-time prediction problem about a human’s intent, made on incomplete evidence, where both kinds of error are immediately audible. This post is about that problem: endpointing, barge-in, and backchanneling — what each one is, why each is hard, and how to measure whether you have gotten it right.

Two scope notes up front. Whether you build a cascade or an end-to-end speech-to-speech model — the subject of the architecture trade study — you have a turn-taking problem; neither architecture gets it for free. And what the agent does when a component fails mid-turn is graceful failure, a separate discipline. This post is the conversational dynamics.

The conversation has a clock

Start with the number turn-taking is measured against. In ordinary human conversation, the gap between one person finishing and the next person starting is short — corpus studies of conversational speech put the median transition gap under about 300ms, and most turns change hands with only 0–200ms of silence, sometimes with a slight overlap. People do not wait for proof that the other person is done. They predict it.

That is the bar, and it exposes the structural difficulty. Producing a spoken response is itself slow — language production research measures utterance-planning latencies well over half a second — yet the observed gap is a fraction of that. The only way the arithmetic works is that humans plan their reply while still listening, predicting where the other person’s turn will end and having a response ready to launch at that boundary. A voice agent that waits for definitive silence before it even begins is already losing a race that humans win by anticipation.

Turn-taking, then, is not one feature. It is three timing decisions the agent makes on every exchange:

  caller speaking ──────────────┐
                                │  ENDPOINTING
                                ▼  "is the caller done?"
                       ┌────────────────┐
                       │  agent decides │
                       └────────────────┘
                                │
            ┌───────────────────┼───────────────────┐
            ▼                   ▼                   ▼
       too early            on time             too late
   cut the caller off    natural gap          dead air
            │                                       │
            └──► BARGE-IN ◄──┐         BACKCHANNEL ──┘
              caller talks   │       caller's "mm-hm" is
              over the agent │       not a turn — and the
              — stop, fast   │       agent should make its own

Each of the three has its own failure mode and its own measurement. Take them in order.

Endpointing — the silence-threshold trap

Endpointing, or end-of-turn detection, is the agent deciding the caller has finished and it is now the agent’s move. Get it wrong in one direction and the agent interrupts; wrong in the other and it leaves dead air. There is no third option, and the two errors pull against each other.

The traditional mechanism is voice-activity detection plus a silence timer. A VAD model — Silero VAD is the open, lightweight default, small enough to process a 30ms audio chunk in roughly a millisecond on a single CPU thread — classifies each frame as speech or not-speech. When the VAD has seen some configured duration of continuous silence, the agent declares the turn over.

That configured duration is the whole problem, because it is one knob and it is being asked to do two contradictory jobs:

Set it short — say 200ms — and the agent is responsive, but it fires on every natural mid-sentence pause. People pause to breathe, to think, to say “ummm.” A 200ms threshold treats “I’d like to check on my…” — said, then a breath — as a completed turn and barges into the caller’s own sentence.
Set it long — say 800ms — and the agent stops interrupting, but now every single turn carries up to 800ms of dead air before the agent responds. The conversation feels laggy and slow, well outside the human gap.

You cannot win this with the silence knob alone, because silence duration does not actually carry the information you need. “I’d like to check on my…” and “I’d like to check on my account balance.” can be followed by the identical length of silence; what differs is not the acoustics of the pause but the meaning of the words before it. One is a grammatically incomplete fragment; the other is a complete request.

That is the case for semantic turn detection — a model that looks at the content of what was said, not just the silence after it. Two production approaches in 2026:

A semantic turn-detector model alongside the VAD. LiveKit’s open-weights turn-detector is the clearest example: a small transformer — its multilingual line distilled from a larger teacher into roughly a half-billion-parameter student — that reads the conversation text and predicts whether the user is actually finished. LiveKit reports that an improved version cut false-positive interruptions by around 39% relative to an earlier one, with no added response latency. Pipecat’s open-source Smart Turn takes the audio-native route — its v3, released in September 2025, runs semantic VAD directly on the waveform with about 12ms of CPU inference and covers 23 languages. The VAD still detects speech presence; the semantic model decides whether the silence means “done.”
A speech-to-text model that does turn detection itself. Deepgram’s Flux, released October 2025, folds transcription and turn detection into one model — the same model that produces the transcript models conversational flow, so it can tell that a turn ending in “because…” is not complete while one ending in “thanks so much.” is. It exposes an eager end-of-turn signal for teams that want to start generating a response speculatively and cancel it if the caller resumes.

The takeaway is not a specific vendor. It is that endpointing on a raw silence threshold has a ceiling, and crossing it means giving the agent a signal about meaning, not just about quiet.

Barge-in — let the caller interrupt, and stop fast

Barge-in is the caller talking while the agent is talking. In a real conversation this is constant and normal — the caller corrects a misunderstanding, answers before the agent finishes the question, or simply changes their mind mid-sentence. An agent that cannot be interrupted — that finishes its scripted sentence while the caller is plainly trying to redirect it — feels less like a conversation and more like a recording, and callers hate it.

Barge-in has two halves, and both have to be fast.

Detect it. While the agent’s TTS is playing, the agent has to keep listening, and detect that the caller has started speaking over it. This is harder than ordinary endpointing because the agent’s own audio is on the line — naive detection will hear the agent’s own voice, or its echo, as a caller barge-in. Acoustic echo cancellation is the floor; above it, the agent has to distinguish a genuine interruption from noise.

Stop fast. Once a barge-in is detected, the agent must stop its own TTS playback and start listening, immediately. Latency here is brutally visible: every extra 100ms the agent keeps talking after the caller has started is 100ms of two people talking over each other, and it reads as the agent not listening. Stopping involves halting TTS synthesis, flushing whatever audio is already buffered downstream toward the caller, and resetting the agent’s state to listening.

The hard part is not the stopping — it is deciding whether to stop, because not every sound from the caller is an interruption. Which is the next problem.

Backchanneling — the “mm-hm” that is not a turn

Backchannels are the short sounds a listener makes to signal “I’m still here, keep going” — “mm-hm,” “uh-huh,” “right,” “okay,” “yeah.” They are not turns. They are not interruptions. The speaker is meant to keep talking straight through them, and in human conversation they do.

A voice agent gets backchanneling wrong in both directions.

Treating the caller’s backchannel as a turn. The caller is mid-explanation. The agent pauses for a half-second. The caller, being a normal human, says “mm-hm” into that pause to signal go on. A naive agent’s VAD registers speech, its endpointer registers a completed short turn, and the agent stops and starts responding — to a non-turn. The caller’s actual sentence is now abandoned. This is a barge-in detector and an endpointer both failing to recognize that “mm-hm” is acoustically speech but conversationally a continuation signal. The fix lives in the same semantic-turn-detection layer as endpointing: a model that reads content can learn that “mm-hm” alone is not a turn-yielding utterance. An audio-native turn model can learn the same from the waveform.

Never producing its own backchannels. The other half, and the one teams forget. When the caller is giving a long answer — reciting an account number, describing a problem — and the agent is utterly silent for fifteen seconds, the caller starts to wonder if the line dropped. A human listener would be dropping “mm-hm”s in. An agent that produces well-placed, sparse backchannels feels present. The danger is overdoing it — backchannels jammed in too often, or placed where they collide with the caller’s words, are worse than silence. This is a small feature with a narrow good zone: a few, well-timed, never on top of the caller.

Measuring turn-taking

Turn-taking is not a thing you can eyeball to “feels good.” It has to be instrumented, because the failures are individually brief and only a distribution reveals them. Four metrics, on every turn, the way the 280ms budget instruments per-component latency:

Response latency — caller’s true end-of-turn to the agent’s first audio. This is the gap human conversation runs near 200–300ms. Budget it per-percentile; the tail is what callers feel.
False-interruption rate — the fraction of turns where the agent started talking while the caller had not actually finished. This is the endpointer firing too eagerly, or a backchannel misread as a turn. The single most damaging turn-taking metric.
Missed-turn / endpointing-lag rate — turns where the agent left an unnaturally long silence after the caller genuinely finished. The endpointer too conservative — the opposite error, and you cannot tune one rate without watching the other.
Barge-in stop latency — from caller-starts-talking-over-the-agent to agent-audio-actually-stops. Every 100ms here is audible over-talk.

The two rates that matter most — false-interruption and missed-turn — are a tradeoff curve, not independent dials. Tighten endpointing to kill dead air and false interruptions climb; loosen it to stop interrupting and lag climbs. You are choosing a point on a curve, and you cannot choose it well without measuring both ends. Plot them together. A voice product lives or dies on where it sits on that curve.

The synthesis

Turn-taking is three timing decisions — when the caller is done, when to yield to an interruption, when a sound is a backchannel and not a turn — and a voice agent has to make all three on every exchange, in real time, against a human conversational clock that runs near 200ms. Transcription being solved does not help with any of them. The leverage is the same in each case: stop deciding on silence acoustics alone and give the agent a signal about meaning, whether through a semantic turn-detector beside the VAD or a speech-to-text model that does turn detection itself. Then instrument the false-interruption and missed-turn rates as the tradeoff curve they are, and pick your point on it deliberately. A voice agent that transcribes perfectly and gets turn-taking wrong is, to the caller, a broken product. The words were never the hard part.

Reading list

LiveKit’s improved end-of-turn detection model — the semantic turn-detector that replaces a raw silence timer, and the measured ~39% cut in false-positive interruptions.
Pipecat’s Smart Turn v3 announcement — an open, audio-native semantic VAD; weights, data, and training scripts published.
Deepgram’s introduction of Flux — the case for folding transcription and turn detection into one conversational model.
Stivers et al., universals and cultural variation in turn-taking — the cross-linguistic evidence for how fast and how tightly humans take turns; the bar your agent is judged against.

The transcript was the easy 80%. Turn-taking is the 20% the caller actually hears.

Turn-taking is the hard part of voice agents.

The conversation has a clock

Endpointing — the silence-threshold trap

Barge-in — let the caller interrupt, and stop fast

Backchanneling — the “mm-hm” that is not a turn

Measuring turn-taking

The synthesis

Reading list

Indirect prompt injection, by the numbers.

Your learning-rate schedule silently overrides your data-curation decisions.

Approving an agent's action is not authorizing it.

Tell us about it.

Got it.