Ever since context windows reached 200,000 tokens and beyond, one question shows up in every AI project: do you still need RAG, or can you just put everything in the prompt? The short answer: it is not one versus the other. They are two tools that answer different constraints — cost, latency, freshness, corpus size — and the right instinct is to choose by constraint, not by hype.
This guide lays out a clear decision framework, backed by real numbers and by our own infrastructure experience: what long context genuinely does well, where it collapses, and how to combine the two.
Both approaches in one sentence
- RAG (Retrieval-Augmented Generation): index the corpus, and for each question retrieve only the relevant passages to inject into the prompt. The model only sees what it needs.
- Long context: place the entire set of documents into the model’s context window on every request. The model sees everything, and it is up to it to find the information.
The myth of the “unlimited” window
The main argument against RAG is: “windows are now 200k, 1M tokens, so you can put it all in there.” True on the spec sheet. Much less so in practice.
On our own long-context benchmarks, we measured a systematic gap between the advertised window and the actually usable one. The model’s ability to retrieve a specific fact buried in the middle of a long document drops well before the stated limit: the cliff often appears between 100k and 120k usable tokens when the spec sheet claims 262k. This is the phenomenon known as “lost in the middle”: information placed at the center of a very long context is recalled far less reliably than information at the start or the end.
Our infrastructure takeaway was blunt: memory (what you choose to show the model) matters more than model architecture. A system that intelligently selects 4,000 relevant tokens almost always beats a system that drowns 120,000 in the hope that the model will sort them out.
The real deciding factor: cost per request
This is where everything is decided in production. Compare a question asked over a 100,000-token knowledge base:
The cost of an LLM in production is dominated by input tokens. Multiplying input by 25 multiplies the inference bill by a similar factor. At high volume, that is the difference between a viable service and a money pit.
Latency, freshness, traceability
Latency (TTFT)
The longer the prompt, the higher the time-to-first-token: the model must read the whole context before answering. A 100k-token prompt adds seconds of prefill to every request. RAG, by injecting only a few thousand tokens, keeps latency low and stable.
Data freshness
With RAG, updating knowledge = re-indexing one document. With everything-in-context, you reload the entire corpus on every change. If your data changes more than once a month, RAG is almost always the right choice.
Source traceability
RAG knows where an answer comes from: the retrieved passage is the citation. Everything-in-context makes traceability fuzzier — the model “read everything,” so pinpointing the exact source is hard. For regulated use or any verifiability requirement, that is a decisive argument.
Decision table
| Criterion | RAG | Long context |
|---|---|---|
| Large / evolving corpus | ✅ Ideal | ❌ Costly, capped |
| Single document fitting the window | ➖ Overkill | ✅ Ideal |
| Cost at high volume | ✅ Low and stable | ❌ High |
| Global reasoning over the whole text | ➖ Fragmented | ✅ Natural |
| Verifiable sources / citations | ✅ Native | ➖ Fuzzy |
| Latency (TTFT) | ✅ Low | ❌ Grows with size |
The architecture that often wins: wide-window RAG
The best answer is usually not binary. Use RAG to bound what the model sees (cost, latency, freshness, citations), then a large context window to reason over several retrieved passages at once. RAG picks 8 to 12 relevant passages; long context lets you synthesize them without fragmenting them. You keep the best of both: controlled cost and multi-document reasoning.
Our take
For 80% of production cases — knowledge base, support, document search, evolving corpus — RAG remains the default choice, for reasons of cost, latency and traceability. Long context shines on the “one document, global reasoning, low volume” case. And the hybrid often beats both. The real question is never “RAG or long context?” but “what are my cost, freshness and volume constraints?”.
Going further: RAG in production, Fine-tuning vs RAG vs Prompt Engineering and vector database comparison.