Why Bigger Context Windows Do Not Solve the Problem
The intuition that 'larger context window = better document understanding' is wrong in two ways. First, many documents still exceed even the largest context windows — a 500-page legal agreement, a multi-year set of financial records, a codebase. Second, models exhibit the 'lost in the middle' phenomenon: they attend well to content near the beginning and end of a long prompt, but poorly to content in the middle.
A well-designed retrieval or summarization strategy consistently outperforms naive full-document insertion for question-answering tasks, even when the document fits in the context window. The model gets better answers from a focused 2,000-token chunk than from a 100,000-token dump where the relevant passage is buried at position 60,000.
Technique 1 — Retrieval (RAG) for Question Answering
If the task is answering questions about a document, retrieval is almost always the right approach. Instead of inserting the full document, retrieve only the passages relevant to the specific question and insert those.
The key insight: a question-answering task only ever needs a small fraction of a document's content for any given question. RAG finds that fraction. Full-document insertion sends all the irrelevant content along for the ride and dilutes the signal.
- Use a structure-aware chunker for your document type — section boundaries for legal documents, function/class boundaries for code, paragraph boundaries for prose.
- Store chunk-level metadata: section number, page number, document title. Include this in retrieved chunks so the model can cite precisely.
- For multi-hop questions that span multiple sections, use a query expansion step: generate 2–3 sub-queries from the original question and retrieve for each.
Technique 2 — Map-Reduce Summarization
When the task requires synthesizing information across an entire document — generating a full summary, identifying all mentions of a topic, extracting all entities — retrieval is insufficient because you need to process every part of the document. Map-reduce handles this:
- 1Map: Split the document into chunks that fit within the context window. Send each chunk to the LLM independently with the same task (e.g., 'summarize this section', 'extract all mentions of X'). Run in parallel.
- 2Reduce: Collect all intermediate outputs. If the combined intermediate outputs fit in the context window, send them all at once for a final synthesis. If not, apply reduce recursively.
import asyncio
async def map_reduce_summarize(document: str, chunk_size: int = 4000) -> str:
chunks = split_into_chunks(document, chunk_size)
# Map: summarize each chunk in parallel
map_tasks = [
llm.acomplete(f"Summarize the key points in this section:\n\n{chunk}")
for chunk in chunks
]
chunk_summaries = await asyncio.gather(*map_tasks)
combined = "\n\n---\n\n".join(chunk_summaries)
# Reduce: if combined fits, synthesize directly
if token_count(combined) < 100_000:
return await llm.acomplete(
f"Synthesize these section summaries into a coherent overall summary:\n\n{combined}"
)
# Otherwise reduce recursively
return await map_reduce_summarize(combined, chunk_size)Technique 3 — Sliding Window with Rolling Context
For tasks that require sequential processing — reviewing a contract clause by clause, analyzing a transcript turn by turn — a sliding window maintains a running context summary that is carried forward through the document.
- Process chunk N with the full text of chunk N in context, plus a compressed summary of all previous chunks.
- After processing, update the running summary to include the key points from chunk N.
- The running summary grows slowly (each step adds a few sentences) while the window moves forward through the document.
- This preserves narrative continuity and cross-reference context that map-reduce misses.
Technique 4 — Hierarchical Document Indexing
For large document collections (not just large individual documents), a two-level index improves retrieval quality significantly. The first level stores document-level summaries; the second level stores chunk-level content.
- Generate a summary embedding for each document and store it in the vector database alongside the chunk embeddings.
- For each incoming query, first retrieve the top-N most relevant documents using document-level similarity.
- Then retrieve the top-K most relevant chunks from within those N documents.
- This avoids returning highly similar chunks from many different documents when the user's query is specific to a narrow subset.
Choosing the Right Technique
- Question answering over a specific document → RAG with structure-aware chunking
- Full document summarization → Map-reduce
- Multi-document synthesis or comparison → Map-reduce with a cross-document reduce step
- Sequential analysis (contract review, code audit) → Sliding window
- Large document collection with diverse queries → Hierarchical indexing
- Document fits in context AND the whole document is relevant → Full insertion (edge case — rarer than you think)
A Note on Cost
Context window size directly drives cost. Sending a 100,000-token document to GPT-4o costs $1.50 per request in input tokens alone. If you process that document to answer 20 different questions, that is $30 in input tokens for one document. With RAG retrieving 2,000 relevant tokens per query, the same 20 questions cost $0.60 total — a 50x reduction in input token cost, with better answer quality.
Context window management is not just a capability decision — it is a cost engineering decision. The right technique for the task often turns out to also be the cheapest one at production volume.