1. What is RAG and When to Use It
Retrieval-Augmented Generation (RAG) connects an LLM to an external knowledge base at query time. Instead of relying solely on what the model learned during training, RAG retrieves relevant document chunks and injects them into the prompt as context.
Use RAG when:
- Your documents change frequently (internal wikis, product catalogs, legal contracts)
- Data is private and was never in any training set (customer records, internal reports)
- You need source attribution — users must be able to verify answers
- Fine-tuning is too expensive or too slow to iterate on
Do not use RAG when: the model already knows the domain well (general coding questions, common knowledge), or when you need the LLM to learn a new behavior rather than new facts — that's what fine-tuning is for.
Cost comparison: Fine-tuning GPT-4o costs ~$25/1M training tokens + $5-15/hour compute. Updating a RAG knowledge base costs ~$0.02/1M tokens for re-embedding changed chunks. For documents that change weekly, RAG is 100-500x cheaper to keep current.
2. Environment Setup
We'll use LangChain 0.3+, ChromaDB (open-source, runs locally), and support both OpenAI and Ollama (free, local inference). Python 3.11+ required.
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Core LangChain packages
pip install langchain==0.3.0 langchain-community==0.3.0 langchain-openai==0.2.0
# Document loaders
pip install pypdf unstructured[pdf] python-docx
# Vector store
pip install chromadb==0.5.0
# Evaluation framework
pip install ragas==0.2.0 datasets
# Optional: local inference (no API costs)
# Install Ollama from https://ollama.ai/, then:
# ollama pull llama3.3:70b
# ollama pull nomic-embed-text
pip install langchain-ollama # LangChain integration for Ollama
Configure API keys — or skip if using Ollama:
# .env file (never commit this)
OPENAI_API_KEY=sk-...
# Load in Python
from dotenv import load_dotenv
load_dotenv()
# Verify
import openai
print(openai.models.list().data[0].id) # Should print a model name
3. Document Loading and Processing
LangChain provides loaders for 50+ formats. All return a list of Document objects with page_content and metadata, so the rest of your pipeline is format-agnostic.
from langchain_community.document_loaders import (
PyPDFLoader,
UnstructuredWordDocumentLoader,
WebBaseLoader,
DirectoryLoader,
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path
# ── Single file loaders ──────────────────────────────────────────────────────
pdf_docs = PyPDFLoader("annual_report.pdf").load()
word_docs = UnstructuredWordDocumentLoader("contract.docx").load()
web_docs = WebBaseLoader("https://docs.example.com/api").load()
# ── Load an entire directory of PDFs ────────────────────────────────────────
dir_loader = DirectoryLoader(
"./knowledge_base/",
glob="**/*.pdf",
loader_cls=PyPDFLoader,
show_progress=True,
)
all_docs = dir_loader.load()
print(f"Loaded {len(all_docs)} pages from {len(set(d.metadata['source'] for d in all_docs))} files")
# ── Add custom metadata before splitting ────────────────────────────────────
for doc in all_docs:
doc.metadata["department"] = "engineering"
doc.metadata["indexed_at"] = "2026-04-07"
# ── Split into chunks ────────────────────────────────────────────────────────
# RecursiveCharacterTextSplitter tries paragraph → sentence → word boundaries
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk (not tokens)
chunk_overlap=200, # Overlap preserves context across chunk boundaries
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(all_docs)
print(f"Split into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")
# Expected output:
# Loaded 87 pages from 5 files
# Split into 412 chunks
# Average chunk size: 847 chars
Chunk size rules of thumb: Technical docs → 800-1200 chars. Legal text → 1500-2000 chars (clauses need full context). FAQ entries → 300-500 chars (one Q&A pair per chunk). Code files → split by function or class, not character count.
4. Indexing: Embeddings and Vector Store
Embeddings convert text into numerical vectors. Semantically similar text produces similar vectors, enabling fast nearest-neighbor search. We'll use ChromaDB — open-source, zero infrastructure, runs in-process or as a server.
from langchain_openai import OpenAIEmbeddings
from langchain_ollama import OllamaEmbeddings # local alternative
from langchain_community.vectorstores import Chroma
import os
# ── Option A: OpenAI embeddings (cost: $0.02 per 1M tokens) ─────────────────
embeddings_openai = OpenAIEmbeddings(model="text-embedding-3-small")
# ── Option B: Local Ollama (cost: $0, requires GPU or fast CPU) ──────────────
embeddings_local = OllamaEmbeddings(model="nomic-embed-text") # 768 dimensions
# Choose one:
embeddings = embeddings_openai # or embeddings_local
PERSIST_DIR = "./chroma_db"
if os.path.exists(PERSIST_DIR):
# Load existing index — no re-embedding needed
vectorstore = Chroma(
persist_directory=PERSIST_DIR,
embedding_function=embeddings,
)
print(f"Loaded existing vector store: {vectorstore._collection.count()} chunks")
else:
# First-time indexing
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=PERSIST_DIR,
collection_metadata={"hnsw:space": "cosine"}, # cosine similarity
)
print(f"Indexed {vectorstore._collection.count()} chunks")
# ── Incremental update (add new docs without reindexing everything) ──────────
new_chunks = splitter.split_documents([new_doc])
vectorstore.add_documents(new_chunks)
print("Incremental update complete")
5. Retrieval Strategies: Similarity, MMR, and Hybrid
Retrieval is the most impactful variable in RAG quality. The embedding model and chunk size set the ceiling; retrieval strategy determines how close you get to it.
5a. Similarity Search (Baseline)
Returns the k chunks with the highest cosine similarity to the query vector. Fast, simple, sufficient for small homogeneous knowledge bases.
# Basic similarity retriever
retriever_sim = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4},
)
results = retriever_sim.invoke("What is our refund policy?")
for doc in results:
print(f"[{doc.metadata.get('source', '?')}] {doc.page_content[:120]}...")
# Weakness: may return 4 near-identical chunks from the same section
5b. MMR — Maximal Marginal Relevance
MMR balances relevance and diversity. It retrieves candidates by similarity but then iteratively selects the one most relevant to the query and least similar to already-selected chunks. Use this when your corpus has many similar documents.
# MMR retriever
retriever_mmr = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 4, # Final number of chunks returned
"fetch_k": 20, # Initial candidate pool to select from
"lambda_mult": 0.6, # 0 = max diversity, 1 = max relevance (0.5-0.7 is sweet spot)
},
)
# MMR example: refund policy query
# Similarity returns: [refund clause, refund clause (paraphrase), refund clause (alt), refund FAQ]
# MMR returns: [refund clause, returns policy, exchange policy, shipping policy]
# → Much more informative context for the LLM
5c. Hybrid Search (BM25 + Semantic)
Hybrid combines keyword search (BM25, exact term matching) with semantic search (embedding similarity). It closes the "vocabulary mismatch" problem: semantic search misses exact product codes or names; BM25 misses paraphrases. Together they win on both.
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
# BM25 keyword retriever (no embeddings, pure term frequency)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4
# Semantic retriever
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Hybrid: weighted combination (0.5 = equal weight)
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, semantic_retriever],
weights=[0.4, 0.6], # Give semantic search slightly more weight
)
# Hybrid excels on queries with specific identifiers
query = "What does section 4.2.1 say about data retention?"
# BM25 finds "4.2.1" exactly; semantic finds "data retention policy" context
results = hybrid_retriever.invoke(query)
print(f"Hybrid retrieved {len(results)} unique chunks")
5d. Reranking with a Cross-Encoder
Reranking runs a second, more accurate model to re-score the initial retrieval results. Adds ~200ms latency but typically improves Recall@4 by 8-12 percentage points. Worth it for production systems.
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
# Load reranker model (runs locally, no API key needed)
# First run downloads ~67MB model
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=3)
# Wrap any retriever with reranking
reranking_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=hybrid_retriever, # Fetch more candidates, rerank to top 3
)
results = reranking_retriever.invoke("data retention for EU customers")
# Results are now sorted by cross-encoder score, not just embedding similarity
6. Generation Chain with Source Attribution
With a retriever ready, we build the generation chain using LangChain Expression Language (LCEL). The chain is retriever → prompt → LLM → parser, all composable with the | operator.
from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama # local alternative
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
# ── LLM setup ────────────────────────────────────────────────────────────────
llm_openai = ChatOpenAI(model="gpt-4o-mini", temperature=0)
llm_local = ChatOllama(model="llama3.3:70b", temperature=0)
llm = llm_openai # switch to llm_local for zero API cost
# ── Prompt template ───────────────────────────────────────────────────────────
SYSTEM = """You are a precise assistant that answers questions based strictly
on the provided context. Rules:
- Only use information from the context below.
- If the context does not contain the answer, say "This information is not in
the provided documents" — do not hallucinate.
- Cite the source document when referencing specific facts.
Context:
{context}"""
prompt = ChatPromptTemplate.from_messages([
("system", SYSTEM),
("human", "{question}"),
])
def format_docs(docs: list) -> str:
"""Format retrieved documents with source labels."""
parts = []
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("source", "unknown")
parts.append(f"[Source {i}: {source}]\n{doc.page_content}")
return "\n\n---\n\n".join(parts)
# ── RAG chain with source attribution ────────────────────────────────────────
rag_chain_with_sources = RunnableParallel(
{
"answer": (
{"context": reranking_retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
),
"sources": reranking_retriever,
}
)
# ── Query ─────────────────────────────────────────────────────────────────────
result = rag_chain_with_sources.invoke("What is the data retention period for EU users?")
print("Answer:", result["answer"])
print("\nSources:")
for doc in result["sources"]:
print(f" - {doc.metadata.get('source')} (page {doc.metadata.get('page', '?')})")
# Expected output:
# Answer: According to [Source 1: privacy_policy.pdf], EU users' data is retained
# for 30 days after account deletion, in compliance with Article 17 GDPR.
#
# Sources:
# - privacy_policy.pdf (page 4)
# - data_processing_agreement.pdf (page 12)
7. Evaluation with RAGAS
This section is the core differentiator of this guide. Without evaluation, you can't know if switching from similarity to MMR actually helped, or if a prompt change improved faithfulness. RAGAS gives you four key metrics:
| Metric | What it measures | Target |
|---|
| Faithfulness | Answer only claims facts present in retrieved context | > 0.85 |
| Answer Relevance | Answer actually addresses the question asked | > 0.80 |
| Context Precision | Retrieved chunks contain relevant information | > 0.75 |
| Context Recall | All necessary information was retrieved | > 0.80 |
7a. Build an Evaluation Dataset
RAGAS requires a test set of questions, expected answers, and retrieved contexts. Start with 20-30 hand-crafted Q&A pairs covering important topics in your knowledge base.
from datasets import Dataset
# Hand-crafted evaluation set (minimum viable: 20 examples)
eval_data = [
{
"question": "What is the refund period for digital products?",
"ground_truth": "Digital products are eligible for a 14-day refund if not downloaded.",
},
{
"question": "Which data centers store EU customer data?",
"ground_truth": "EU customer data is stored exclusively in Frankfurt (eu-central-1) and Dublin (eu-west-1) AWS regions.",
},
{
"question": "What is the maximum file size for uploads?",
"ground_truth": "The maximum upload file size is 500MB per file, with a 5GB total per account per day.",
},
# Add 17+ more examples covering your knowledge base...
]
# Retrieve contexts for each question
def build_eval_dataset(eval_data: list, retriever) -> Dataset:
rows = []
for item in eval_data:
docs = retriever.invoke(item["question"])
rows.append({
"question": item["question"],
"ground_truth": item["ground_truth"],
"contexts": [doc.page_content for doc in docs],
"answer": rag_chain_with_sources.invoke(item["question"])["answer"],
})
return Dataset.from_list(rows)
eval_dataset = build_eval_dataset(eval_data, reranking_retriever)
print(f"Evaluation dataset: {len(eval_dataset)} examples")
7b. Run RAGAS Evaluation
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
# Evaluate all four metrics
results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=llm, # Uses your LLM to score answers
embeddings=embeddings_openai, # Uses embeddings for relevance scoring
)
print(results)
# Expected output (good RAG system):
# {'faithfulness': 0.89, 'answer_relevancy': 0.84,
# 'context_precision': 0.78, 'context_recall': 0.82}
# Export detailed per-question results
df = results.to_pandas()
df.to_csv("rag_evaluation_results.csv", index=False)
# Find worst-performing questions
worst = df.nsmallest(5, "faithfulness")[["question", "faithfulness", "contexts"]]
print("\nTop 5 low-faithfulness questions:")
print(worst.to_string())
7c. Interpreting Results and Iterating
| Symptom | Root Cause | Fix |
|---|
| Low Faithfulness (<0.75) | LLM adds information not in context | Strengthen system prompt: "Only use provided context". Use temperature=0. |
| Low Context Precision (<0.60) | Retrieval brings in off-topic chunks | Reduce chunk size, add metadata filters, switch to MMR or hybrid |
| Low Context Recall (<0.70) | Relevant info not retrieved at all | Increase k, check chunk overlap, try reranking |
| Low Answer Relevance (<0.70) | Answer drifts from question; poor prompt | Add explicit instruction to stay on topic, improve prompt template |
# A/B test retrieval strategies using RAGAS
def benchmark_retriever(retriever, label: str):
dataset = build_eval_dataset(eval_data, retriever)
scores = evaluate(dataset, metrics=[faithfulness, context_precision, context_recall], llm=llm, embeddings=embeddings_openai)
print(f"\n{label}:")
print(f" Faithfulness: {scores['faithfulness']:.3f}")
print(f" Context Precision: {scores['context_precision']:.3f}")
print(f" Context Recall: {scores['context_recall']:.3f}")
return scores
benchmark_retriever(retriever_sim, "Similarity (baseline)")
benchmark_retriever(retriever_mmr, "MMR")
benchmark_retriever(hybrid_retriever, "Hybrid BM25+Semantic")
benchmark_retriever(reranking_retriever, "Hybrid + Reranking")
# Sample real-world results on a 5000-chunk knowledge base:
# Similarity (baseline): Faithfulness=0.79 Precision=0.64 Recall=0.71
# MMR: Faithfulness=0.81 Precision=0.70 Recall=0.73
# Hybrid BM25+Semantic: Faithfulness=0.84 Precision=0.76 Recall=0.79
# Hybrid + Reranking: Faithfulness=0.89 Precision=0.81 Recall=0.84
# → Reranking adds +10pp faithfulness vs. baseline at ~200ms latency cost
8. Production Best Practices
Cache Embeddings to Cut Costs
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore
store = LocalFileStore("./embedding_cache/")
cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
underlying_embeddings=embeddings_openai,
document_embedding_cache=store,
namespace="text-embedding-3-small",
)
# Repeated queries hit disk cache, not OpenAI — saves ~60% of embedding costs
Add Guardrails for Safety
def safe_query(question: str, max_length: int = 500) -> dict:
"""Validate and sanitize input before RAG processing."""
# Input length check
if len(question) > max_length:
return {"error": "Question too long. Please be more concise."}
# Block prompt injection attempts
injection_patterns = ["ignore previous", "system prompt", "jailbreak"]
if any(p in question.lower() for p in injection_patterns):
return {"error": "Invalid query format."}
return rag_chain_with_sources.invoke(question)
Monitor Latency in Production
import time
from dataclasses import dataclass
@dataclass
class QueryMetrics:
question: str
retrieval_ms: float
generation_ms: float
total_ms: float
chunks_retrieved: int
def query_with_metrics(question: str) -> tuple[dict, QueryMetrics]:
t0 = time.time()
docs = reranking_retriever.invoke(question)
t1 = time.time()
context = format_docs(docs)
answer = (prompt | llm | StrOutputParser()).invoke(
{"context": context, "question": question}
)
t2 = time.time()
metrics = QueryMetrics(
question=question,
retrieval_ms=(t1 - t0) * 1000,
generation_ms=(t2 - t1) * 1000,
total_ms=(t2 - t0) * 1000,
chunks_retrieved=len(docs),
)
return {"answer": answer, "sources": docs}, metrics
result, m = query_with_metrics("What is our SLA for enterprise customers?")
print(f"Total: {m.total_ms:.0f}ms (retrieval={m.retrieval_ms:.0f}ms, generation={m.generation_ms:.0f}ms)")
# Target: retrieval < 200ms, generation < 2000ms, total < 2500ms
Next Steps
- Multi-document RAG: Combine structured data (SQL), unstructured text, and real-time APIs in one knowledge graph
- Conversational RAG: Add
ConversationBufferWindowMemory to maintain session context across turns - Agentic RAG: Let LangGraph decide when to retrieve, when to reason, and when to ask for clarification
- Self-querying retriever: LLM generates metadata filters automatically from natural language
For structured training on these advanced topics:
Frequently Asked Questions
What makes this RAG tutorial different from others?
Most tutorials stop at 'it works'. This guide includes evaluation with RAGAS — so you can measure faithfulness, answer relevance, and context recall with real numbers, not gut feelings. It also covers advanced retrieval strategies (MMR, hybrid search, reranking) that close the gap from prototype to production.
Do I need an OpenAI API key to follow this tutorial?
No. Every code example has a local alternative using Ollama (free, runs on your machine). The OpenAI version is shown first for clarity, but Ollama alternatives are always provided. You need at least 16GB RAM and ~10GB disk for the local models.
What is RAGAS and why should I use it?
RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that automatically evaluates four key metrics: Faithfulness (does the answer match the retrieved context?), Answer Relevance (does the answer address the question?), Context Precision (is the retrieved context on-topic?), and Context Recall (are all relevant facts retrieved?). Without evaluation, you're flying blind — RAG systems that 'feel good' often score below 70% on faithfulness.
When should I use MMR instead of similarity search?
Use MMR (Maximal Marginal Relevance) when your knowledge base has many similar documents and you get repetitive context in your answers. MMR explicitly penalizes redundancy, retrieving diverse-but-relevant chunks instead of the top-k most similar ones. Typical case: a product FAQ where many questions overlap — MMR retrieves one relevant chunk per topic instead of five paraphrases of the same answer.
How do I handle documents that are updated frequently?
Use ChromaDB's upsert with stable document IDs derived from content hash or file path + modification date. For daily updates: pull new/changed files, upsert only changed chunks (keep IDs stable for unchanged content). Weekly: run a full consistency check comparing your file system to the vector store and delete orphaned chunks. This avoids full reindexing which can take hours on large corpora.