All posts
LLM Engineering · Web
May 5, 20269 min read

Implementing LLM Streaming in Production Web Apps: SSE, WebSockets, and the Edge Cases That Break Everything

M

Moneeb Abbas

AI Systems Architect

Streaming is the difference between a UI that feels alive and one that feels broken. Without streaming, users stare at a blank area waiting for the full response — a wait that feels much longer than it is. With streaming, the first token appears almost immediately and text flows in naturally. The implementation is not complicated, but the production edge cases are.

SSE vs WebSockets: Choosing the Right Transport

Most LLM streaming use cases only require one-way communication from server to client: the server streams tokens as they are generated. Two transports handle this, with meaningfully different tradeoffs:

  • Server-Sent Events (SSE): HTTP-based, one-directional (server to client), built-in reconnection, works through proxies and load balancers with no special configuration, supported natively in every browser. The right choice for 80% of LLM streaming use cases.
  • WebSockets: Full-duplex, persistent bidirectional connection, lower overhead for high-frequency messages, requires special proxy configuration (sticky sessions or WebSocket-aware load balancer). The right choice when the client also needs to stream data to the server — voice audio, real-time collaborative editing, or tool approval workflows.
Tip:Default to SSE unless you have a specific need for client-to-server streaming. SSE is simpler to implement, simpler to deploy, and requires no special infrastructure. The reconnection behavior is also free — the browser automatically retries dropped connections.

SSE Implementation: Server Side

An SSE endpoint is a long-lived HTTP response with Content-Type: text/event-stream. Each token from the LLM is written to the response as it arrives:

python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

async def token_stream(prompt: str):
    async for chunk in llm_client.astream(prompt):
        token = chunk.choices[0].delta.content or ""
        if token:
            # SSE format: "data: <payload>\n\n"
            yield f"data: {json.dumps({'token': token})}\n\n"
        await asyncio.sleep(0)  # yield control to event loop

    yield "data: [DONE]\n\n"

@app.post("/stream")
async def stream_response(request: PromptRequest):
    return StreamingResponse(
        token_stream(request.prompt),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # disable Nginx buffering
        },
    )

SSE Implementation: Client Side

typescript
async function streamCompletion(prompt: string, onToken: (t: string) => void) {
  const response = await fetch("/stream", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ prompt }),
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value, { stream: true });
    for (const line of chunk.split("\n")) {
      if (!line.startsWith("data: ")) continue;
      const data = line.slice(6);
      if (data === "[DONE]") return;

      const { token } = JSON.parse(data);
      onToken(token);  // update UI
    }
  }
}

The Production Edge Cases

The happy path implementation above works in development. These are the edge cases that appear under production load:

  • Proxy and load balancer buffering: Nginx, AWS ALB, and Cloudflare all buffer responses by default. A buffered SSE response delivers all tokens at once at the end — defeating the purpose. Set X-Accel-Buffering: no for Nginx, and configure your CDN to pass through streaming responses.
  • Client disconnection mid-stream: The user closes the tab while the model is generating. Without handling this, the server continues generating and billing for tokens nobody will read. Detect client disconnection and cancel the upstream LLM request.
  • Partial JSON in chunks: The streaming transport may split an event across multiple read() calls. Always buffer and split on newlines, not on read boundaries.
  • Error mid-stream: The LLM API returns an error after streaming has started. You cannot change the HTTP status code once the response body has started. Communicate errors through the SSE event stream itself using a typed error event.
  • Token rate limiting: Displaying tokens character-by-character at 100+ tokens/second can cause excessive DOM re-renders. Batch tokens into 50–100ms display intervals for smooth rendering without UI jank.
python
# Detect client disconnection and cancel LLM request
from starlette.requests import Request

@app.post("/stream")
async def stream_response(request: Request, body: PromptRequest):
    async def generate():
        try:
            async for chunk in llm_client.astream(body.prompt):
                # Check if client disconnected
                if await request.is_disconnected():
                    break
                token = chunk.choices[0].delta.content or ""
                if token:
                    yield f"data: {json.dumps({'token': token})}\n\n"
        except Exception as e:
            # Surface errors through the stream
            yield f"data: {json.dumps({'error': str(e)})}\n\n"
        finally:
            yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Handling Streaming in React

The simplest React pattern accumulates tokens into state. The key detail is that token updates must be appended, not replaced, and the component must handle the stream lifecycle correctly:

typescript
function ChatMessage({ prompt }: { prompt: string }) {
  const [content, setContent] = useState("");
  const [streaming, setStreaming] = useState(true);

  useEffect(() => {
    let cancelled = false;

    streamCompletion(prompt, (token) => {
      if (!cancelled) {
        setContent((prev) => prev + token);
      }
    }).finally(() => {
      if (!cancelled) setStreaming(false);
    });

    // Cancel stream if component unmounts mid-generation
    return () => { cancelled = true; };
  }, [prompt]);

  return (
    <div>
      {content}
      {streaming && <span className="animate-pulse">▋</span>}
    </div>
  );
}

Streaming with Next.js App Router

Next.js App Router supports streaming natively via React Suspense. For LLM streaming specifically, a Route Handler with a ReadableStream response is the cleanest pattern:

typescript
// app/api/chat/route.ts
import { OpenAI } from "openai";

export async function POST(req: Request) {
  const { prompt } = await req.json();
  const client = new OpenAI();

  const stream = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    stream: true,
  });

  const encoder = new TextEncoder();

  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const token = chunk.choices[0]?.delta?.content ?? "";
        if (token) {
          controller.enqueue(
            encoder.encode(`data: ${JSON.stringify({ token })}\n\n`)
          );
        }
      }
      controller.enqueue(encoder.encode("data: [DONE]\n\n"));
      controller.close();
    },
  });

  return new Response(readable, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
    },
  });
}

Working on something similar?

I take on 1–2 new projects per month. If you have a use case that needs this kind of engineering, tell me about it.

Get in touch