All posts
RAG · Embeddings
May 12, 20269 min read

Choosing and Optimizing Embedding Models for Production RAG: Beyond the Default

M

Moneeb Abbas

AI Systems Architect

Most teams ship RAG systems with text-embedding-3-small or text-embedding-ada-002 because that is what the tutorial used. For general-purpose text in English, this is fine. For domain-specific content, multilingual documents, or applications where retrieval quality directly affects revenue, the embedding model choice is one of the highest-leverage decisions in the stack.

What Embedding Models Actually Do

An embedding model maps text to a dense vector in a high-dimensional space where semantically similar texts are geometrically close. The quality of that mapping determines retrieval quality: a model that maps 'termination clause' and 'contract termination provision' close together will retrieve the right passage; one that does not will miss it.

The key insight: embedding quality is task-specific. A model trained on general web text excels at general semantic similarity. A model trained on legal documents understands that 'indemnification' and 'hold harmless' are synonymous. If your corpus and queries have domain-specific vocabulary, general-purpose embeddings will underperform.

The Embedding Model Landscape

  • OpenAI text-embedding-3-small / large: Strong general-purpose baseline, easy API integration, 1536 and 3072 dimensions. The large model is competitive with open-source alternatives. Managed, no infrastructure required.
  • BGE-M3 (BAAI): State-of-the-art open-source model. Supports dense, sparse, and multi-vector retrieval in one model. Multilingual (100+ languages). Runs locally — essential for data residency requirements.
  • E5-mistral-7b-instruct: Instruction-tuned embedding model. Outperforms most models on MTEB benchmarks, especially for asymmetric retrieval (short query, long document). Higher compute cost.
  • Cohere Embed v3: Strong multilingual support, native input type distinction (query vs document), integrated reranking ecosystem. Managed API.
  • domain fine-tuned models: Models fine-tuned on domain-specific data consistently outperform general models in that domain. Worth considering when retrieval quality directly affects business outcomes.
Note:MTEB (Massive Text Embedding Benchmark) is the standard leaderboard for comparing embedding models. Check your use case category — retrieval, classification, clustering — rather than the overall score. A model ranked 5th overall may rank 1st for your specific task.

Evaluating Retrieval Quality

The only way to know which embedding model is right for your use case is to measure retrieval quality on your actual data. The evaluation process:

  1. 1Build an evaluation set: 50–200 query/relevant-document pairs. The query is what a real user would ask; the relevant document is the correct passage to retrieve. Use production queries if available; write synthetic ones otherwise.
  2. 2Measure recall@k: for each query, retrieve the top-k chunks and check whether the relevant document appears. Recall@5 (is the right answer in the top 5?) is the most useful metric for RAG.
  3. 3Measure MRR (Mean Reciprocal Rank): the average of 1/rank for the first relevant result. Penalizes models that rank the correct answer at position 5 vs position 1.
  4. 4Compare models: run every candidate model through the same evaluation set and compare recall@5 and MRR. The difference between models on your domain data is often surprising.
python
def evaluate_retrieval(
    eval_set: list[dict],   # [{"query": str, "relevant_doc_id": str}]
    retriever,
    k: int = 5,
) -> dict:
    recall_hits = 0
    reciprocal_ranks = []

    for item in eval_set:
        results = retriever.search(item["query"], top_k=k)
        result_ids = [r.id for r in results]

        # Recall@k
        if item["relevant_doc_id"] in result_ids:
            recall_hits += 1

        # MRR
        try:
            rank = result_ids.index(item["relevant_doc_id"]) + 1
            reciprocal_ranks.append(1.0 / rank)
        except ValueError:
            reciprocal_ranks.append(0.0)

    return {
        f"recall@{k}": recall_hits / len(eval_set),
        "mrr": sum(reciprocal_ranks) / len(reciprocal_ranks),
    }

Dimensionality: More Is Not Always Better

Higher-dimensional embeddings can capture more nuance but have real costs: more storage, slower similarity search at scale, and higher memory usage in your vector database. The tradeoffs:

  • 768 dimensions: typical for mid-size open-source models (BGE-base, E5-base). Good quality, efficient storage. Right for most production RAG systems.
  • 1536 dimensions: OpenAI text-embedding-3-small, many BERT-large derived models. Higher quality ceiling, roughly 2x storage vs 768.
  • 3072 dimensions: OpenAI text-embedding-3-large, some frontier models. Best quality for general text, but 4x storage cost vs 768. Diminishing returns for most domain-specific use cases.
  • Matryoshka Representation Learning (MRL): models like text-embedding-3 support truncating to lower dimensions with graceful quality degradation. You can use 256 dimensions for coarse retrieval and 1536 for re-ranking without two separate models.

When to Fine-Tune Embeddings

Fine-tuning an embedding model on domain-specific data consistently delivers 5–20% retrieval improvement on that domain. The signal to pursue it: you have run the evaluation above, a general-purpose model is underperforming on your eval set, and the domain has genuinely specialized vocabulary that general training data does not cover well.

The training data you need: query/relevant-document pairs, ideally 1,000–10,000 examples. You can generate synthetic training pairs by prompting an LLM to write questions that would be answered by each document in your corpus — a technique called synthetic query generation.

  • Use sentence-transformers library for fine-tuning — well-documented, efficient, supports all major base models.
  • Fine-tune with MultipleNegativesRankingLoss or TripletLoss on (query, positive_doc, negative_doc) triples.
  • Start from BGE-M3 or E5-base, not from scratch — the general semantic understanding transfers.
  • Evaluate on a held-out eval set before and after fine-tuning to confirm improvement.

Asymmetric Retrieval: Query vs Document Embeddings

In most RAG systems, queries are short (one sentence) and documents are long (multiple paragraphs). Some embedding models handle this asymmetry explicitly by using different representations for queries and documents. BGE models prefix queries with 'Represent this sentence:' and documents with 'Represent this passage:'. E5 models use 'query:' and 'passage:' prefixes. Using the wrong prefix — or no prefix — can silently degrade retrieval quality by 5–15%.

Warning:Always read the model card for the embedding model you are using. Many models require specific prefixes for queries vs documents, and omitting them is a silent quality degradation that does not surface as an error.

Working on something similar?

I take on 1–2 new projects per month. If you have a use case that needs this kind of engineering, tell me about it.

Get in touch