The Three Deployment Paths
Every LLM production deployment falls into one of three categories, and the right choice is almost entirely determined by volume and compliance constraints:
- 1Cloud API (OpenAI, Anthropic, Google): Pay-per-token, zero infrastructure management, fastest time to production. Right for low-to-medium volume and no data residency requirements.
- 2Managed inference (Replicate, Together AI, Groq, AWS Bedrock): You choose the model, a third party hosts it. Better economics than cloud APIs at scale, some data residency options available, still limited operational overhead.
- 3Self-hosted (your hardware, your cloud account): Full control, best unit economics at volume, maximum compliance flexibility. Requires GPU infrastructure management and a team capable of running it.
Decision Variable 1 — Monthly Inference Volume
Volume is the primary driver. Run the following calculation before making any infrastructure decision:
- Estimate your monthly token volume: (requests/day) × (average tokens/request) × 30
- Price that at your current or target API cost (e.g., $15/1M tokens for GPT-4o)
- Price the equivalent self-hosted compute: reserved GPU instance cost + ops overhead
- Crossover is where self-hosted monthly cost < API monthly cost
A concrete example: 10 million tokens/day = 300M tokens/month. At $15/1M tokens, that is $4,500/month on GPT-4o. A single A100 80GB reserved instance on AWS runs roughly $1,800–2,200/month and can handle that volume comfortably with vLLM and a well-sized model. The math favors self-hosting — but only if you have someone who can run it.
Decision Variable 2 — Latency Requirements
Not all use cases have the same latency tolerance. Batch processing jobs (nightly report generation, document classification pipelines) can tolerate 5–30 seconds per request. Interactive applications (chatbots, voice AI, copilots) need sub-1-second first-token latency.
- Groq: Fastest managed inference available — 500+ tokens/second on Llama 3 models. Best for latency-critical applications that can use open-weight models.
- vLLM on A100/H100: Sub-400ms TTFT for 70B models with continuous batching. Best self-hosted option for latency-sensitive workloads.
- Ollama: Simple to deploy, good for development and low-concurrency production. Not suitable for high-throughput applications.
- Cloud APIs: Latency is variable and outside your control. GPT-4o averages 400–800ms TTFT but can spike significantly under load.
Decision Variable 3 — Compliance Constraints
If your data is subject to HIPAA, GDPR data residency requirements, or SOC 2 controls, your infrastructure choices narrow significantly:
- HIPAA: Cloud APIs are viable only with a signed BAA — available from Azure OpenAI and AWS Bedrock, not from OpenAI's standard API. Air-gapped self-hosted eliminates the risk entirely.
- GDPR data residency: Requires inference to occur within EU borders. AWS Bedrock EU regions and self-hosted in EU data centers are the cleanest options.
- SOC 2: Cloud API providers with SOC 2 certification (most major ones) are generally acceptable. Document your vendor risk assessment.
- Air-gapped (highest security): Only self-hosted on your own hardware qualifies. No managed inference option provides true air-gapping.
GPU Selection Guide
If you are going self-hosted, GPU selection determines your throughput ceiling and cost floor. The options that matter in 2026:
- NVIDIA H100 80GB SXM: Best throughput for large models (70B+). NVLink for multi-GPU tensor parallelism. Expensive — justified at high volume.
- NVIDIA A100 80GB: Slightly lower throughput than H100 but significantly cheaper. The current sweet spot for most production deployments.
- NVIDIA A10G 24GB: Good for smaller models (7B–13B) and quantized 34B models. Available on AWS g5 instances. Cost-effective for medium-volume workloads.
- NVIDIA RTX 4090 24GB: Consumer card, surprisingly capable for self-hosted deployments with quantized models. Not available in cloud — for on-premises hardware only.
- AMD MI300X 192GB: Large memory footprint enables very large models without tensor parallelism. ROCm ecosystem is maturing; worth evaluating for new deployments.
Serving Framework Comparison
- vLLM: Best throughput for production via PagedAttention and continuous batching. OpenAI-compatible API. The default choice for serious production deployments.
- Text Generation Inference (TGI): Hugging Face's serving framework. Good model compatibility, slightly behind vLLM on raw throughput benchmarks.
- Ollama: Simplest setup, good for development and single-user production. No native batching — not suitable for concurrent user workloads.
- LiteLLM proxy: Not an inference engine, but a unified API gateway that routes to any backend. Excellent for teams that want to switch between providers without changing application code.
The Decision Matrix
- Under $3K/month API spend + no compliance constraints → Cloud API (OpenAI, Anthropic)
- Under $3K/month API spend + latency critical → Groq or managed inference on open-weight models
- Over $3K/month + team can run infra + no air-gap requirement → Self-hosted on cloud GPUs
- HIPAA or air-gap required → Self-hosted on-premises only
- Rapidly changing volume or uncertain trajectory → Start cloud API, instrument costs, migrate when crossover is clear
The most expensive mistake I see teams make is premature optimization: building self-hosted infrastructure before they have the volume to justify it. The second most expensive is the opposite: staying on cloud APIs at $20K/month of spend because migration feels complex. The decision framework above prevents both.