Breaking Down the Latency Budget
A voice AI pipeline has three sequential stages: speech recognition (STT), language model inference (LLM), and speech synthesis (TTS). Each stage contributes latency, and they cannot be parallelized in the naive case. Understanding where the time goes is the first step.
- STT: Transcribing 5 seconds of speech with a server-side Whisper model — typically 200–600ms depending on model size and hardware
- LLM first-token latency: Time from receiving the transcription to the first output token — typically 300–800ms on a well-served 70B model
- TTS: Converting the LLM response to audio — the biggest variable, ranging from 200ms (streaming, first audio chunk) to 2000ms+ (non-streaming, full synthesis before playback)
- Network round trips: 2–4 round trips, each adding 20–80ms depending on infrastructure
The naive pipeline — transcribe fully, then generate text fully, then synthesize audio fully — blows the 2-second budget on LLM and TTS alone. The solution is streaming at every stage: start the next stage the moment there is enough output from the previous one.
The WebSocket Architecture
HTTP request-response is the wrong transport for a pipeline with streaming at every stage. WebSockets give you a persistent bidirectional channel that eliminates the connection overhead of repeated HTTP requests and allows the server to push audio chunks as they are synthesized.
# FastAPI WebSocket handler — simplified
@app.websocket("/voice")
async def voice_pipeline(websocket: WebSocket):
await websocket.accept()
async for message in websocket.iter_bytes():
# 1. STT: transcribe audio chunk
transcript = await transcribe_streaming(message)
if not transcript.is_final:
continue
# 2. LLM: stream tokens as they arrive
token_stream = llm_client.stream(transcript.text)
# 3. TTS: convert token batches to audio and stream back
async for audio_chunk in tts_stream(token_stream):
await websocket.send_bytes(audio_chunk)Streaming Whisper for STT
Whisper's standard API takes a complete audio file and returns a complete transcription. For low-latency voice, you want to start getting text back before the user finishes speaking — but you also need to avoid sending incomplete sentences to the LLM.
The technique is Voice Activity Detection (VAD) + end-of-utterance detection. We use Silero VAD to detect when the user has stopped speaking (typically 500ms of silence after speech), then send the audio buffer to Whisper for transcription. This keeps the full audio coherence of Whisper while adding responsive end-of-turn detection.
- faster-whisper: Whisper with CTranslate2 backend — 4x faster than the reference implementation on CPU, comparable on GPU
- Whisper large-v3 for accuracy, medium for lower latency — test both against your use case
- Silero VAD: lightweight, runs on CPU, adds ~5ms overhead for end-of-utterance detection
- Audio streaming: send 100ms audio chunks from the client; buffer on server until VAD detects end-of-turn
LLM Token Streaming
The LLM's time-to-first-token (TTFT) is the dominant latency contributor when the response is short. For voice, where answers are often 1–3 sentences, TTFT is almost the entire LLM contribution.
vLLM with continuous batching delivers the best TTFT at scale. For single-user or low-concurrency deployments, Ollama with a properly quantized model is simpler to deploy and delivers sub-400ms TTFT for 7B–13B models on consumer hardware.
The key streaming optimization: do not wait for the full LLM response before starting TTS. As soon as you have a sentence boundary (period, question mark, or exclamation point), send that sentence to TTS and start synthesizing audio while the LLM continues generating the rest.
Low-Latency TTS
Text-to-speech is historically the biggest latency sink in voice pipelines. Traditional TTS models process the entire text and return a complete audio file. Modern streaming TTS models return audio chunks within 100–300ms of receiving the first sentence.
- ElevenLabs (streaming mode): Best voice quality, sub-400ms first-audio-chunk latency via streaming API
- Kokoro: Open-source, runs locally, surprisingly good quality for an on-prem option, ~200ms first chunk
- Coqui XTTS v2: Good for voice cloning use cases, slightly higher latency than Kokoro
- F5-TTS: Fast, lightweight, zero-shot voice cloning — excellent for applications requiring custom voices
For the voice AI product we built, we used ElevenLabs in streaming mode with sentence-boundary chunking from the LLM. The first audio chunk arrives at the client before the LLM has finished generating the complete response.
Latency Results
Measured on the production deployment (5-second average user utterance, 2-sentence average LLM response):
- STT (faster-whisper large-v3 on A10G): 180ms median, 290ms p95
- LLM TTFT (Llama 3.1 70B on vLLM, 2x A100): 310ms median, 480ms p95
- TTS first audio chunk (ElevenLabs streaming): 220ms median, 340ms p95
- Total pipeline (user stops talking → first audio plays): 720ms median, 1.1s p95
- Well within the 2-second target, with headroom for network variance
What We Learned
The biggest lesson from building this pipeline: latency problems compound. A 100ms regression in any stage is a 100ms regression in user experience. Instrument every stage with percentile latency tracking from day one — do not wait for a complaint.
Also: audio quality and response quality matter more than speed past a certain point. A 1.5-second response with excellent voice quality and a helpful answer is better than a 0.8-second response with robotic audio and a wrong answer. Optimize for the 2-second target, then invest the remaining effort in quality.