All posts
Cost Optimization · LLMs
March 10, 20269 min read

LLM Cost Optimization Beyond Self-Hosting: Batching, Caching, Model Routing, and Prompt Compression

M

Moneeb Abbas

AI Systems Architect

Switching to a self-hosted model is the most dramatic LLM cost reduction available — but it is also the most complex. Before you go down that path, or alongside it, there are API-level optimizations that can cut your costs by 30–70% with a fraction of the engineering effort. These techniques work whether you are paying OpenAI $20K a month or running your own vLLM cluster.

Know Where the Money Goes First

Every cost optimization project should start with a token audit. Build a dashboard that breaks down your LLM spend by: endpoint or feature, model, prompt tokens vs completion tokens, and user segment. In every audit I have run, 20% of use cases account for 80% of cost — and those 20% are the only ones worth optimizing.

Tip:Log prompt_tokens and completion_tokens from every LLM API response. They are included in the usage field at no extra cost. This data is the foundation of every cost optimization decision.

Technique 1 — Async Batching

Most LLM providers offer a batch API that processes requests asynchronously at 50% of the standard per-token price. The tradeoff: results are returned within 24 hours rather than immediately. For workloads that do not require real-time responses, this is the single highest-ROI optimization available.

  • OpenAI Batch API: 50% discount, 24-hour completion window. Accepts up to 50,000 requests per batch file.
  • Anthropic Message Batches: same economics. Process up to 10,000 requests per batch.
  • Suitable use cases: nightly document classification, bulk content generation, offline evaluation runs, report generation, embedding generation for large document corpora.
  • Not suitable: any user-facing feature with a latency requirement, real-time agents, streaming interfaces.
python
from openai import OpenAI
import json

client = OpenAI()

# Build batch request file
requests = [
    {
        "custom_id": f"doc-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "Classify this document into one category."},
                {"role": "user", "content": doc_text},
            ],
            "max_tokens": 50,
        },
    }
    for i, doc_text in enumerate(documents)
]

# Upload and submit batch — 50% cheaper, results within 24h
batch_file = client.files.create(
    file=("batch.jsonl", "\n".join(json.dumps(r) for r in requests)),
    purpose="batch",
)
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

Technique 2 — Semantic Caching

Exact-match response caching (cache the response when the prompt is byte-identical) is easy but low hit rate. Semantic caching caches responses and retrieves them when a new query is semantically similar to a previously answered one — even if the wording differs.

The implementation: embed each incoming query, search a cache index for similar past queries, and return the cached response if similarity exceeds a threshold. A threshold of 0.95 cosine similarity gives high precision with meaningful hit rates for FAQ-style applications.

python
import redis
import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.95):
        self.threshold = similarity_threshold
        self.cache = redis.Redis()

    def get(self, query: str) -> str | None:
        query_vec = embed(query)
        # Search cache index for similar past queries
        results = vector_store.search("query_cache", query_vec, top_k=1)
        if results and results[0].score >= self.threshold:
            cached_key = results[0].payload["cache_key"]
            return self.cache.get(cached_key)
        return None

    def set(self, query: str, response: str, ttl: int = 3600):
        query_vec = embed(query)
        cache_key = f"cache:{hash(query)}"
        self.cache.setex(cache_key, ttl, response)
        vector_store.upsert("query_cache", query_vec, {"cache_key": cache_key})

In support chatbot and FAQ use cases, semantic caching typically achieves 30–50% hit rates on production traffic. At a 40% hit rate, you are paying for 60% of the LLM calls you would otherwise make.

Technique 3 — Model Routing

Not every query needs your most capable and expensive model. A customer asking 'what are your business hours' does not need GPT-4o. A customer asking the system to draft a legal clause does. Model routing classifies each request and sends it to the cheapest model capable of handling it well.

  • Complexity classification: train a lightweight classifier (or use a fast, cheap LLM) to categorize each query as simple, medium, or complex.
  • Simple queries → GPT-4o-mini or Claude Haiku: 10–20x cheaper than frontier models, sufficient for factual lookups, simple Q&A, and classification tasks.
  • Complex queries → GPT-4o or Claude Sonnet: multi-step reasoning, nuanced analysis, long-form generation.
  • Routing overhead: the classification call costs tokens too. Use a rule-based heuristic (query length, keyword detection) first; escalate to LLM-based classification only for ambiguous cases.

In a typical mixed-complexity SaaS application, 60–70% of queries are simple enough for a mini model. If your routing accuracy is 90%, the cost reduction is still 50–60% on those queries.

Technique 4 — Prompt Compression

Prompt tokens are often the majority of total token cost in RAG systems — especially when retrieved context is long. Prompt compression reduces the token count of context before sending it to the model, while preserving the information needed to answer the query.

  • LLMLingua: Microsoft's open-source prompt compression library. Uses a small LM to score token importance and remove low-importance tokens from context. Achieves 3–20x compression with acceptable quality degradation.
  • Selective chunk inclusion: instead of sending all retrieved chunks, send only the top-2 most relevant (rather than top-5). Less compression ratio but zero quality risk for high-relevance chunks.
  • Summary compression: replace retrieved chunks with LLM-generated summaries focused on the query. A 500-token chunk summarized to 100 tokens loses some detail but saves 400 tokens per chunk.
  • Structured data compaction: tables and lists embedded in prompts can often be reformatted more compactly without losing information. A JSON object with verbose keys can be minified.

Technique 5 — Prompt Caching (Provider-Level)

Both Anthropic and OpenAI offer provider-level prompt caching for repeated prefixes. If your system prompt is 2,000 tokens and you send 100,000 requests per day, you are paying for 200 million prompt tokens daily for content that never changes. With prompt caching, you pay full price once per cache window and a fraction for all subsequent requests.

  • Anthropic: mark system prompt with cache_control. Cached at 10% of normal input token cost within a 5-minute TTL. Effectively free after the first call in a session.
  • OpenAI: automatic for prompts over 1,024 tokens. 50% discount on cached prefix tokens within a session.
  • Structure your prompts to maximize the cached prefix: put the static system prompt first, followed by dynamic content (retrieved context, user message). The longer the static prefix, the larger the cache benefit.

Combining Techniques: A Real Example

On a production RAG-based support application with 500,000 queries per month at an average of 3,000 tokens per request (2,500 prompt, 500 completion), the baseline cost at GPT-4o pricing was approximately $5,250/month. After applying all techniques:

  • Semantic caching (35% hit rate): effectively 325,000 billable requests → saves ~35%
  • Model routing (65% of queries to GPT-4o-mini): remaining queries split by complexity → saves ~55% on routed queries
  • Prompt caching (1,800-token static system prompt): cached prefix on all requests → saves ~60% on prompt tokens for non-cached queries
  • Combined effective cost reduction: approximately 78% vs baseline
  • Final monthly cost: ~$1,150 vs $5,250 — no self-hosting, no infrastructure changes

The optimizations compound multiplicatively. Semantic caching reduces the number of LLM calls; model routing reduces the cost per call on the remaining ones; prompt caching reduces the token cost per call further. Each technique is independent and can be implemented incrementally.

Working on something similar?

I take on 1–2 new projects per month. If you have a use case that needs this kind of engineering, tell me about it.

Get in touch