All posts
RAG · Document Intelligence
March 20, 202611 min read

Building a Production RAG Pipeline: Chunking, Hybrid Search, and Re-Ranking That Actually Works

M

Moneeb Abbas

AI Systems Architect

Most RAG demos work great on the happy path. A clean question, a clean document, a clean match. Production is not the happy path. This post covers what breaks in real RAG deployments — and the specific techniques that fixed them in a production legal document system handling 40,000+ contracts.

The Three Failure Modes of Naive RAG

Naive RAG — chunk a document into fixed-size pieces, embed them, store in a vector database, retrieve by cosine similarity — works in demos because demos are designed to succeed. In production, three failure modes appear immediately:

  1. 1Retrieval misses the answer even though the answer is in the corpus. The relevant passage did not rank in the top-k because the query semantics diverged from the chunk embedding.
  2. 2Retrieval returns the right document but the wrong chunk. The answer is split across a chunk boundary, so neither chunk is independently useful.
  3. 3The LLM hallucinates a citation. It produces a confident answer that paraphrases content from the wrong document or invents a clause that does not exist.

Each failure mode requires a different fix. Trying to solve all three by just improving your embeddings — the most common advice — will not work. Embeddings are only one part of the retrieval chain.

Chunking Strategy: The Most Underrated Lever

The default chunking advice — 512 tokens, 50 token overlap — is a reasonable starting point for generic text. For structured documents like contracts, medical records, or financial reports, it is almost always wrong.

Legal contracts have a natural structure: parties, recitals, definitions, clauses, schedules. A fixed-size chunker splits clauses mid-sentence and puts unrelated definitions in the same chunk. For the legal RAG system I built, we replaced fixed-size chunking with a structure-aware chunker that:

  • Identified section boundaries using heading patterns and indentation
  • Split at clause level, not token count — a clause stays in one chunk even if it is 800 tokens
  • Added a metadata header to each chunk: document name, section number, clause type, date range
  • Created overlapping 'context chunks' — short summaries of adjacent sections prepended to each chunk
Tip:If you cannot describe why a document should be chunked the way you are chunking it, you are probably using a generic strategy. Domain-specific chunking is worth the engineering time.

Hybrid Search: Dense + Sparse

Dense retrieval (embedding similarity via models like text-embedding-3-large or BGE) captures semantic meaning well but struggles with exact keyword matches. If a lawyer asks about 'Section 7.3 liquidated damages clause', a dense search might return semantically similar content from other sections instead of the exact clause.

Sparse retrieval (BM25 or BM42) is excellent for exact keyword and phrase matching but misses paraphrasing. Hybrid search combines both:

python
from qdrant_client import QdrantClient
from qdrant_client.models import SearchRequest, SparseVector

def hybrid_search(query: str, top_k: int = 20) -> list[dict]:
    dense_results = client.search(
        collection_name="contracts",
        query_vector=embed(query),
        limit=top_k,
        with_payload=True,
    )

    sparse_results = client.search(
        collection_name="contracts",
        query_vector=SparseVector(**bm25_encode(query)),
        limit=top_k,
        with_payload=True,
    )

    return reciprocal_rank_fusion(dense_results, sparse_results)

Reciprocal Rank Fusion (RRF) is a simple and effective way to merge results from two ranked lists without needing calibrated scores. The formula is straightforward: for each document, sum 1/(rank + k) across all lists it appears in, where k is a smoothing constant (typically 60).

Cross-Encoder Re-Ranking

After hybrid retrieval, you typically have 20–40 candidates. The retrieval step optimizes for recall — getting the right chunk somewhere in the results. Re-ranking optimizes for precision — putting the best chunk at the top.

Bi-encoder models (used for the dense embeddings) encode the query and document independently, making them fast enough for large-scale retrieval. Cross-encoders process the query and document together, giving them much better judgment about relevance — at the cost of being too slow to run over the full corpus.

The two-stage approach: retrieve 20–40 candidates cheaply, re-rank with a cross-encoder, send the top 5 to the LLM. For the legal system, we used Cohere's Rerank API. You can also run an open-source cross-encoder like BGE-reranker-v2-m3 on your own infrastructure if you need data residency.

Note:In the legal contract system, adding re-ranking improved answer accuracy on a test set of 200 expert-labeled queries from 71% to 89%. It is one of the highest-ROI improvements you can make to a RAG pipeline.

Citation: Every Answer Must Prove Itself

In legal and compliance contexts, a correct answer without a source is useless. The LLM must cite the specific clause it is drawing from, and the application must verify that citation is real.

The implementation has two parts. First, the prompt requires structured citation output:

python
SYSTEM_PROMPT = """
You are a legal document analyst. Answer based ONLY on the provided context.

For each claim in your answer, you MUST include a citation in this format:
[Contract: {document_name}, Section {section}, Clause {clause_number}]

If the answer is not in the provided context, say: "I cannot find this in the provided documents."
Do not speculate or use general legal knowledge.
"""

Second, after the LLM response is generated, a post-processing step validates each citation: it retrieves the referenced chunk by document name, section, and clause number, and checks that the LLM's paraphrase is semantically consistent with the source. If the citation cannot be verified, the answer is flagged for human review.

The Stack We Used in Production

  • Qdrant: Vector database — supports both dense and sparse vectors natively, good performance at 40K documents
  • BGE-M3: Embedding model — multilingual, handles both dense and sparse encoding
  • Cohere Rerank: Cross-encoder re-ranking via API
  • LangChain: Orchestration for the retrieval chain and prompt management
  • FastAPI: Serving layer with async request handling
  • AWS Aurora PostgreSQL: Metadata store and citation audit log

The entire pipeline — from query to cited answer — runs in under 3 seconds at the p95 latency level for a corpus of 40,000 contracts. The main bottleneck is the re-ranker API call, which can be parallelized for the top candidates if needed.

Working on something similar?

I take on 1–2 new projects per month. If you have a use case that needs this kind of engineering, tell me about it.

Get in touch