Sub-2-Second Voice AI: How to Build a Real-Time STT → LLM → TTS Pipeline

Voice interfaces feel natural when they respond in under 2 seconds. Above 2 seconds, users start to wonder if the system is broken. Building a voice AI pipeline that consistently hits this target requires treating latency as a first-class constraint at every stage — not something you optimize after the pipeline works.

Breaking Down the Latency Budget

A voice AI pipeline has three sequential stages: speech recognition (STT), language model inference (LLM), and speech synthesis (TTS). Each stage contributes latency, and they cannot be parallelized in the naive case. Understanding where the time goes is the first step.

STT: Transcribing 5 seconds of speech with a server-side Whisper model — typically 200–600ms depending on model size and hardware
LLM first-token latency: Time from receiving the transcription to the first output token — typically 300–800ms on a well-served 70B model
TTS: Converting the LLM response to audio — the biggest variable, ranging from 200ms (streaming, first audio chunk) to 2000ms+ (non-streaming, full synthesis before playback)
Network round trips: 2–4 round trips, each adding 20–80ms depending on infrastructure

The naive pipeline — transcribe fully, then generate text fully, then synthesize audio fully — blows the 2-second budget on LLM and TTS alone. The solution is streaming at every stage: start the next stage the moment there is enough output from the previous one.

The WebSocket Architecture

HTTP request-response is the wrong transport for a pipeline with streaming at every stage. WebSockets give you a persistent bidirectional channel that eliminates the connection overhead of repeated HTTP requests and allows the server to push audio chunks as they are synthesized.

python

# FastAPI WebSocket handler — simplified
@app.websocket("/voice")
async def voice_pipeline(websocket: WebSocket):
    await websocket.accept()

    async for message in websocket.iter_bytes():
        # 1. STT: transcribe audio chunk
        transcript = await transcribe_streaming(message)
        if not transcript.is_final:
            continue

        # 2. LLM: stream tokens as they arrive
        token_stream = llm_client.stream(transcript.text)

        # 3. TTS: convert token batches to audio and stream back
        async for audio_chunk in tts_stream(token_stream):
            await websocket.send_bytes(audio_chunk)

Streaming Whisper for STT

Whisper's standard API takes a complete audio file and returns a complete transcription. For low-latency voice, you want to start getting text back before the user finishes speaking — but you also need to avoid sending incomplete sentences to the LLM.

The technique is Voice Activity Detection (VAD) + end-of-utterance detection. We use Silero VAD to detect when the user has stopped speaking (typically 500ms of silence after speech), then send the audio buffer to Whisper for transcription. This keeps the full audio coherence of Whisper while adding responsive end-of-turn detection.

faster-whisper: Whisper with CTranslate2 backend — 4x faster than the reference implementation on CPU, comparable on GPU
Whisper large-v3 for accuracy, medium for lower latency — test both against your use case
Silero VAD: lightweight, runs on CPU, adds ~5ms overhead for end-of-utterance detection
Audio streaming: send 100ms audio chunks from the client; buffer on server until VAD detects end-of-turn

LLM Token Streaming

The LLM's time-to-first-token (TTFT) is the dominant latency contributor when the response is short. For voice, where answers are often 1–3 sentences, TTFT is almost the entire LLM contribution.

vLLM with continuous batching delivers the best TTFT at scale. For single-user or low-concurrency deployments, Ollama with a properly quantized model is simpler to deploy and delivers sub-400ms TTFT for 7B–13B models on consumer hardware.

Tip:For voice AI specifically, smaller and faster often beats larger and slower. A well-prompted Llama 3.2 3B can answer conversational queries more responsively than a 70B model with double the TTFT.

The key streaming optimization: do not wait for the full LLM response before starting TTS. As soon as you have a sentence boundary (period, question mark, or exclamation point), send that sentence to TTS and start synthesizing audio while the LLM continues generating the rest.

Low-Latency TTS

Text-to-speech is historically the biggest latency sink in voice pipelines. Traditional TTS models process the entire text and return a complete audio file. Modern streaming TTS models return audio chunks within 100–300ms of receiving the first sentence.

ElevenLabs (streaming mode): Best voice quality, sub-400ms first-audio-chunk latency via streaming API
Kokoro: Open-source, runs locally, surprisingly good quality for an on-prem option, ~200ms first chunk
Coqui XTTS v2: Good for voice cloning use cases, slightly higher latency than Kokoro
F5-TTS: Fast, lightweight, zero-shot voice cloning — excellent for applications requiring custom voices

For the voice AI product we built, we used ElevenLabs in streaming mode with sentence-boundary chunking from the LLM. The first audio chunk arrives at the client before the LLM has finished generating the complete response.

Latency Results

Measured on the production deployment (5-second average user utterance, 2-sentence average LLM response):

STT (faster-whisper large-v3 on A10G): 180ms median, 290ms p95
LLM TTFT (Llama 3.1 70B on vLLM, 2x A100): 310ms median, 480ms p95
TTS first audio chunk (ElevenLabs streaming): 220ms median, 340ms p95
Total pipeline (user stops talking → first audio plays): 720ms median, 1.1s p95
Well within the 2-second target, with headroom for network variance

Note:The 720ms median end-to-end latency means the voice assistant feels genuinely conversational — not like a system processing a request. This is the threshold where user testing shows a meaningful jump in satisfaction scores.

What We Learned

The biggest lesson from building this pipeline: latency problems compound. A 100ms regression in any stage is a 100ms regression in user experience. Instrument every stage with percentile latency tracking from day one — do not wait for a complaint.

Also: audio quality and response quality matter more than speed past a certain point. A 1.5-second response with excellent voice quality and a helpful answer is better than a 0.8-second response with robotic audio and a wrong answer. Optimize for the 2-second target, then invest the remaining effort in quality.