The Three Failure Modes of Naive RAG
Naive RAG — chunk a document into fixed-size pieces, embed them, store in a vector database, retrieve by cosine similarity — works in demos because demos are designed to succeed. In production, three failure modes appear immediately:
- 1Retrieval misses the answer even though the answer is in the corpus. The relevant passage did not rank in the top-k because the query semantics diverged from the chunk embedding.
- 2Retrieval returns the right document but the wrong chunk. The answer is split across a chunk boundary, so neither chunk is independently useful.
- 3The LLM hallucinates a citation. It produces a confident answer that paraphrases content from the wrong document or invents a clause that does not exist.
Each failure mode requires a different fix. Trying to solve all three by just improving your embeddings — the most common advice — will not work. Embeddings are only one part of the retrieval chain.
Chunking Strategy: The Most Underrated Lever
The default chunking advice — 512 tokens, 50 token overlap — is a reasonable starting point for generic text. For structured documents like contracts, medical records, or financial reports, it is almost always wrong.
Legal contracts have a natural structure: parties, recitals, definitions, clauses, schedules. A fixed-size chunker splits clauses mid-sentence and puts unrelated definitions in the same chunk. For the legal RAG system I built, we replaced fixed-size chunking with a structure-aware chunker that:
- Identified section boundaries using heading patterns and indentation
- Split at clause level, not token count — a clause stays in one chunk even if it is 800 tokens
- Added a metadata header to each chunk: document name, section number, clause type, date range
- Created overlapping 'context chunks' — short summaries of adjacent sections prepended to each chunk
Hybrid Search: Dense + Sparse
Dense retrieval (embedding similarity via models like text-embedding-3-large or BGE) captures semantic meaning well but struggles with exact keyword matches. If a lawyer asks about 'Section 7.3 liquidated damages clause', a dense search might return semantically similar content from other sections instead of the exact clause.
Sparse retrieval (BM25 or BM42) is excellent for exact keyword and phrase matching but misses paraphrasing. Hybrid search combines both:
from qdrant_client import QdrantClient
from qdrant_client.models import SearchRequest, SparseVector
def hybrid_search(query: str, top_k: int = 20) -> list[dict]:
dense_results = client.search(
collection_name="contracts",
query_vector=embed(query),
limit=top_k,
with_payload=True,
)
sparse_results = client.search(
collection_name="contracts",
query_vector=SparseVector(**bm25_encode(query)),
limit=top_k,
with_payload=True,
)
return reciprocal_rank_fusion(dense_results, sparse_results)Reciprocal Rank Fusion (RRF) is a simple and effective way to merge results from two ranked lists without needing calibrated scores. The formula is straightforward: for each document, sum 1/(rank + k) across all lists it appears in, where k is a smoothing constant (typically 60).
Cross-Encoder Re-Ranking
After hybrid retrieval, you typically have 20–40 candidates. The retrieval step optimizes for recall — getting the right chunk somewhere in the results. Re-ranking optimizes for precision — putting the best chunk at the top.
Bi-encoder models (used for the dense embeddings) encode the query and document independently, making them fast enough for large-scale retrieval. Cross-encoders process the query and document together, giving them much better judgment about relevance — at the cost of being too slow to run over the full corpus.
The two-stage approach: retrieve 20–40 candidates cheaply, re-rank with a cross-encoder, send the top 5 to the LLM. For the legal system, we used Cohere's Rerank API. You can also run an open-source cross-encoder like BGE-reranker-v2-m3 on your own infrastructure if you need data residency.
Citation: Every Answer Must Prove Itself
In legal and compliance contexts, a correct answer without a source is useless. The LLM must cite the specific clause it is drawing from, and the application must verify that citation is real.
The implementation has two parts. First, the prompt requires structured citation output:
SYSTEM_PROMPT = """
You are a legal document analyst. Answer based ONLY on the provided context.
For each claim in your answer, you MUST include a citation in this format:
[Contract: {document_name}, Section {section}, Clause {clause_number}]
If the answer is not in the provided context, say: "I cannot find this in the provided documents."
Do not speculate or use general legal knowledge.
"""Second, after the LLM response is generated, a post-processing step validates each citation: it retrieves the referenced chunk by document name, section, and clause number, and checks that the LLM's paraphrase is semantically consistent with the source. If the citation cannot be verified, the answer is flagged for human review.
The Stack We Used in Production
- Qdrant: Vector database — supports both dense and sparse vectors natively, good performance at 40K documents
- BGE-M3: Embedding model — multilingual, handles both dense and sparse encoding
- Cohere Rerank: Cross-encoder re-ranking via API
- LangChain: Orchestration for the retrieval chain and prompt management
- FastAPI: Serving layer with async request handling
- AWS Aurora PostgreSQL: Metadata store and citation audit log
The entire pipeline — from query to cited answer — runs in under 3 seconds at the p95 latency level for a corpus of 40,000 contracts. The main bottleneck is the re-ranker API call, which can be parallelized for the top candidates if needed.