RAG with LangChain: Complete Implementation and Evaluation Guide (2026)

Q: What makes this RAG tutorial different from others?

Most tutorials stop at 'it works'. This guide includes evaluation with RAGAS — so you can measure faithfulness, answer relevance, and context recall with real numbers, not gut feelings. It also covers advanced retrieval strategies (MMR, hybrid search, reranking) that close the gap from prototype to production.

Q: Do I need an OpenAI API key to follow this tutorial?

No. Every code example has a local alternative using Ollama (free, runs on your machine). The OpenAI version is shown first for clarity, but Ollama alternatives are always provided. You need at least 16GB RAM and ~10GB disk for the local models.

Q: What is RAGAS and why should I use it?

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that automatically evaluates four key metrics: Faithfulness (does the answer match the retrieved context?), Answer Relevance (does the answer address the question?), Context Precision (is the retrieved context on-topic?), and Context Recall (are all relevant facts retrieved?). Without evaluation, you're flying blind — RAG systems that 'feel good' often score below 70% on faithfulness.

Q: When should I use MMR instead of similarity search?

Use MMR (Maximal Marginal Relevance) when your knowledge base has many similar documents and you get repetitive context in your answers. MMR explicitly penalizes redundancy, retrieving diverse-but-relevant chunks instead of the top-k most similar ones. Typical case: a product FAQ where many questions overlap — MMR retrieves one relevant chunk per topic instead of five paraphrases of the same answer.

Q: How do I handle documents that are updated frequently?

Use ChromaDB's upsert with stable document IDs derived from content hash or file path + modification date. For daily updates: pull new/changed files, upsert only changed chunks (keep IDs stable for unchanged content). Weekly: run a full consistency check comparing your file system to the vector store and delete orphaned chunks. This avoids full reindexing which can take hours on large corpora.

1. What is RAG and When to Use It

Retrieval-Augmented Generation (RAG) connects an LLM to an external knowledge base at query time. Instead of relying solely on what the model learned during training, RAG retrieves relevant document chunks and injects them into the prompt as context.

Use RAG when:

Your documents change frequently (internal wikis, product catalogs, legal contracts)
Data is private and was never in any training set (customer records, internal reports)
You need source attribution — users must be able to verify answers
Fine-tuning is too expensive or too slow to iterate on

Do not use RAG when: the model already knows the domain well (general coding questions, common knowledge), or when you need the LLM to learn a new behavior rather than new facts — that's what fine-tuning is for.

Cost comparison: Fine-tuning GPT-4o costs ~$25/1M training tokens + $5-15/hour compute. Updating a RAG knowledge base costs ~$0.02/1M tokens for re-embedding changed chunks. For documents that change weekly, RAG is 100-500x cheaper to keep current.

2. Environment Setup

We'll use LangChain 0.3+, ChromaDB (open-source, runs locally), and support both OpenAI and Ollama (free, local inference). Python 3.11+ required.

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

# Core LangChain packages
pip install langchain==0.3.0 langchain-community==0.3.0 langchain-openai==0.2.0

# Document loaders
pip install pypdf unstructured[pdf] python-docx

# Vector store
pip install chromadb==0.5.0

# Evaluation framework
pip install ragas==0.2.0 datasets

# Optional: local inference (no API costs)
# Install Ollama from https://ollama.ai/, then:
# ollama pull llama3.3:70b
# ollama pull nomic-embed-text

pip install langchain-ollama  # LangChain integration for Ollama

Configure API keys — or skip if using Ollama:

# .env file (never commit this)
OPENAI_API_KEY=sk-...

# Load in Python
from dotenv import load_dotenv
load_dotenv()

# Verify
import openai
print(openai.models.list().data[0].id)  # Should print a model name

3. Document Loading and Processing

LangChain provides loaders for 50+ formats. All return a list of Document objects with page_content and metadata, so the rest of your pipeline is format-agnostic.

from langchain_community.document_loaders import (
    PyPDFLoader,
    UnstructuredWordDocumentLoader,
    WebBaseLoader,
    DirectoryLoader,
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path

# ── Single file loaders ──────────────────────────────────────────────────────
pdf_docs   = PyPDFLoader("annual_report.pdf").load()
word_docs  = UnstructuredWordDocumentLoader("contract.docx").load()
web_docs   = WebBaseLoader("https://docs.example.com/api").load()

# ── Load an entire directory of PDFs ────────────────────────────────────────
dir_loader = DirectoryLoader(
    "./knowledge_base/",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True,
)
all_docs = dir_loader.load()
print(f"Loaded {len(all_docs)} pages from {len(set(d.metadata['source'] for d in all_docs))} files")

# ── Add custom metadata before splitting ────────────────────────────────────
for doc in all_docs:
    doc.metadata["department"] = "engineering"
    doc.metadata["indexed_at"] = "2026-04-07"

# ── Split into chunks ────────────────────────────────────────────────────────
# RecursiveCharacterTextSplitter tries paragraph → sentence → word boundaries
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,     # Characters per chunk (not tokens)
    chunk_overlap=200,   # Overlap preserves context across chunk boundaries
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(all_docs)

print(f"Split into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")

# Expected output:
# Loaded 87 pages from 5 files
# Split into 412 chunks
# Average chunk size: 847 chars

Chunk size rules of thumb: Technical docs → 800-1200 chars. Legal text → 1500-2000 chars (clauses need full context). FAQ entries → 300-500 chars (one Q&A pair per chunk). Code files → split by function or class, not character count.

4. Indexing: Embeddings and Vector Store

Embeddings convert text into numerical vectors. Semantically similar text produces similar vectors, enabling fast nearest-neighbor search. We'll use ChromaDB — open-source, zero infrastructure, runs in-process or as a server.

from langchain_openai import OpenAIEmbeddings
from langchain_ollama import OllamaEmbeddings  # local alternative
from langchain_community.vectorstores import Chroma
import os

# ── Option A: OpenAI embeddings (cost: $0.02 per 1M tokens) ─────────────────
embeddings_openai = OpenAIEmbeddings(model="text-embedding-3-small")

# ── Option B: Local Ollama (cost: $0, requires GPU or fast CPU) ──────────────
embeddings_local = OllamaEmbeddings(model="nomic-embed-text")  # 768 dimensions

# Choose one:
embeddings = embeddings_openai  # or embeddings_local

PERSIST_DIR = "./chroma_db"

if os.path.exists(PERSIST_DIR):
    # Load existing index — no re-embedding needed
    vectorstore = Chroma(
        persist_directory=PERSIST_DIR,
        embedding_function=embeddings,
    )
    print(f"Loaded existing vector store: {vectorstore._collection.count()} chunks")
else:
    # First-time indexing
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=PERSIST_DIR,
        collection_metadata={"hnsw:space": "cosine"},  # cosine similarity
    )
    print(f"Indexed {vectorstore._collection.count()} chunks")

# ── Incremental update (add new docs without reindexing everything) ──────────
new_chunks = splitter.split_documents([new_doc])
vectorstore.add_documents(new_chunks)
print("Incremental update complete")

5. Retrieval Strategies: Similarity, MMR, and Hybrid

Retrieval is the most impactful variable in RAG quality. The embedding model and chunk size set the ceiling; retrieval strategy determines how close you get to it.

5a. Similarity Search (Baseline)

Returns the k chunks with the highest cosine similarity to the query vector. Fast, simple, sufficient for small homogeneous knowledge bases.

# Basic similarity retriever
retriever_sim = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4},
)

results = retriever_sim.invoke("What is our refund policy?")
for doc in results:
    print(f"[{doc.metadata.get('source', '?')}] {doc.page_content[:120]}...")
# Weakness: may return 4 near-identical chunks from the same section

5b. MMR — Maximal Marginal Relevance

MMR balances relevance and diversity. It retrieves candidates by similarity but then iteratively selects the one most relevant to the query and least similar to already-selected chunks. Use this when your corpus has many similar documents.

# MMR retriever
retriever_mmr = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 4,             # Final number of chunks returned
        "fetch_k": 20,      # Initial candidate pool to select from
        "lambda_mult": 0.6, # 0 = max diversity, 1 = max relevance (0.5-0.7 is sweet spot)
    },
)

# MMR example: refund policy query
# Similarity returns: [refund clause, refund clause (paraphrase), refund clause (alt), refund FAQ]
# MMR returns:        [refund clause, returns policy, exchange policy, shipping policy]
# → Much more informative context for the LLM

5c. Hybrid Search (BM25 + Semantic)

Hybrid combines keyword search (BM25, exact term matching) with semantic search (embedding similarity). It closes the "vocabulary mismatch" problem: semantic search misses exact product codes or names; BM25 misses paraphrases. Together they win on both.

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# BM25 keyword retriever (no embeddings, pure term frequency)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4

# Semantic retriever
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Hybrid: weighted combination (0.5 = equal weight)
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, semantic_retriever],
    weights=[0.4, 0.6],  # Give semantic search slightly more weight
)

# Hybrid excels on queries with specific identifiers
query = "What does section 4.2.1 say about data retention?"
# BM25 finds "4.2.1" exactly; semantic finds "data retention policy" context
results = hybrid_retriever.invoke(query)
print(f"Hybrid retrieved {len(results)} unique chunks")

5d. Reranking with a Cross-Encoder

Reranking runs a second, more accurate model to re-score the initial retrieval results. Adds ~200ms latency but typically improves Recall@4 by 8-12 percentage points. Worth it for production systems.

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker

# Load reranker model (runs locally, no API key needed)
# First run downloads ~67MB model
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=3)

# Wrap any retriever with reranking
reranking_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=hybrid_retriever,  # Fetch more candidates, rerank to top 3
)

results = reranking_retriever.invoke("data retention for EU customers")
# Results are now sorted by cross-encoder score, not just embedding similarity

6. Generation Chain with Source Attribution

With a retriever ready, we build the generation chain using LangChain Expression Language (LCEL). The chain is retriever → prompt → LLM → parser, all composable with the | operator.

from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama          # local alternative
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

# ── LLM setup ────────────────────────────────────────────────────────────────
llm_openai = ChatOpenAI(model="gpt-4o-mini", temperature=0)
llm_local  = ChatOllama(model="llama3.3:70b", temperature=0)
llm = llm_openai  # switch to llm_local for zero API cost

# ── Prompt template ───────────────────────────────────────────────────────────
SYSTEM = """You are a precise assistant that answers questions based strictly
on the provided context. Rules:
- Only use information from the context below.
- If the context does not contain the answer, say "This information is not in
  the provided documents" — do not hallucinate.
- Cite the source document when referencing specific facts.

Context:
{context}"""

prompt = ChatPromptTemplate.from_messages([
    ("system", SYSTEM),
    ("human", "{question}"),
])

def format_docs(docs: list) -> str:
    """Format retrieved documents with source labels."""
    parts = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("source", "unknown")
        parts.append(f"[Source {i}: {source}]\n{doc.page_content}")
    return "\n\n---\n\n".join(parts)

# ── RAG chain with source attribution ────────────────────────────────────────
rag_chain_with_sources = RunnableParallel(
    {
        "answer": (
            {"context": reranking_retriever | format_docs, "question": RunnablePassthrough()}
            | prompt
            | llm
            | StrOutputParser()
        ),
        "sources": reranking_retriever,
    }
)

# ── Query ─────────────────────────────────────────────────────────────────────
result = rag_chain_with_sources.invoke("What is the data retention period for EU users?")

print("Answer:", result["answer"])
print("\nSources:")
for doc in result["sources"]:
    print(f"  - {doc.metadata.get('source')} (page {doc.metadata.get('page', '?')})")

# Expected output:
# Answer: According to [Source 1: privacy_policy.pdf], EU users' data is retained
# for 30 days after account deletion, in compliance with Article 17 GDPR.
#
# Sources:
#   - privacy_policy.pdf (page 4)
#   - data_processing_agreement.pdf (page 12)

7. Evaluation with RAGAS

This section is the core differentiator of this guide. Without evaluation, you can't know if switching from similarity to MMR actually helped, or if a prompt change improved faithfulness. RAGAS gives you four key metrics:

Metric	What it measures	Target
Faithfulness	Answer only claims facts present in retrieved context	> 0.85
Answer Relevance	Answer actually addresses the question asked	> 0.80
Context Precision	Retrieved chunks contain relevant information	> 0.75
Context Recall	All necessary information was retrieved	> 0.80

7a. Build an Evaluation Dataset

RAGAS requires a test set of questions, expected answers, and retrieved contexts. Start with 20-30 hand-crafted Q&A pairs covering important topics in your knowledge base.

from datasets import Dataset

# Hand-crafted evaluation set (minimum viable: 20 examples)
eval_data = [
    {
        "question": "What is the refund period for digital products?",
        "ground_truth": "Digital products are eligible for a 14-day refund if not downloaded.",
    },
    {
        "question": "Which data centers store EU customer data?",
        "ground_truth": "EU customer data is stored exclusively in Frankfurt (eu-central-1) and Dublin (eu-west-1) AWS regions.",
    },
    {
        "question": "What is the maximum file size for uploads?",
        "ground_truth": "The maximum upload file size is 500MB per file, with a 5GB total per account per day.",
    },
    # Add 17+ more examples covering your knowledge base...
]

# Retrieve contexts for each question
def build_eval_dataset(eval_data: list, retriever) -> Dataset:
    rows = []
    for item in eval_data:
        docs = retriever.invoke(item["question"])
        rows.append({
            "question": item["question"],
            "ground_truth": item["ground_truth"],
            "contexts": [doc.page_content for doc in docs],
            "answer": rag_chain_with_sources.invoke(item["question"])["answer"],
        })
    return Dataset.from_list(rows)

eval_dataset = build_eval_dataset(eval_data, reranking_retriever)
print(f"Evaluation dataset: {len(eval_dataset)} examples")

7b. Run RAGAS Evaluation

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

# Evaluate all four metrics
results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=llm,                      # Uses your LLM to score answers
    embeddings=embeddings_openai,  # Uses embeddings for relevance scoring
)

print(results)
# Expected output (good RAG system):
# {'faithfulness': 0.89, 'answer_relevancy': 0.84,
#  'context_precision': 0.78, 'context_recall': 0.82}

# Export detailed per-question results
df = results.to_pandas()
df.to_csv("rag_evaluation_results.csv", index=False)

# Find worst-performing questions
worst = df.nsmallest(5, "faithfulness")[["question", "faithfulness", "contexts"]]
print("\nTop 5 low-faithfulness questions:")
print(worst.to_string())

7c. Interpreting Results and Iterating

Symptom	Root Cause	Fix
Low Faithfulness (<0.75)	LLM adds information not in context	Strengthen system prompt: "Only use provided context". Use temperature=0.
Low Context Precision (<0.60)	Retrieval brings in off-topic chunks	Reduce chunk size, add metadata filters, switch to MMR or hybrid
Low Context Recall (<0.70)	Relevant info not retrieved at all	Increase k, check chunk overlap, try reranking
Low Answer Relevance (<0.70)	Answer drifts from question; poor prompt	Add explicit instruction to stay on topic, improve prompt template

# A/B test retrieval strategies using RAGAS
def benchmark_retriever(retriever, label: str):
    dataset = build_eval_dataset(eval_data, retriever)
    scores = evaluate(dataset, metrics=[faithfulness, context_precision, context_recall], llm=llm, embeddings=embeddings_openai)
    print(f"\n{label}:")
    print(f"  Faithfulness:      {scores['faithfulness']:.3f}")
    print(f"  Context Precision: {scores['context_precision']:.3f}")
    print(f"  Context Recall:    {scores['context_recall']:.3f}")
    return scores

benchmark_retriever(retriever_sim, "Similarity (baseline)")
benchmark_retriever(retriever_mmr, "MMR")
benchmark_retriever(hybrid_retriever, "Hybrid BM25+Semantic")
benchmark_retriever(reranking_retriever, "Hybrid + Reranking")

# Sample real-world results on a 5000-chunk knowledge base:
# Similarity (baseline):   Faithfulness=0.79  Precision=0.64  Recall=0.71
# MMR:                     Faithfulness=0.81  Precision=0.70  Recall=0.73
# Hybrid BM25+Semantic:    Faithfulness=0.84  Precision=0.76  Recall=0.79
# Hybrid + Reranking:      Faithfulness=0.89  Precision=0.81  Recall=0.84
# → Reranking adds +10pp faithfulness vs. baseline at ~200ms latency cost

8. Production Best Practices

Cache Embeddings to Cut Costs

from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore

store = LocalFileStore("./embedding_cache/")
cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings=embeddings_openai,
    document_embedding_cache=store,
    namespace="text-embedding-3-small",
)
# Repeated queries hit disk cache, not OpenAI — saves ~60% of embedding costs

Add Guardrails for Safety

def safe_query(question: str, max_length: int = 500) -> dict:
    """Validate and sanitize input before RAG processing."""
    # Input length check
    if len(question) > max_length:
        return {"error": "Question too long. Please be more concise."}

    # Block prompt injection attempts
    injection_patterns = ["ignore previous", "system prompt", "jailbreak"]
    if any(p in question.lower() for p in injection_patterns):
        return {"error": "Invalid query format."}

    return rag_chain_with_sources.invoke(question)

Monitor Latency in Production

import time
from dataclasses import dataclass

@dataclass
class QueryMetrics:
    question: str
    retrieval_ms: float
    generation_ms: float
    total_ms: float
    chunks_retrieved: int

def query_with_metrics(question: str) -> tuple[dict, QueryMetrics]:
    t0 = time.time()
    docs = reranking_retriever.invoke(question)
    t1 = time.time()

    context = format_docs(docs)
    answer = (prompt | llm | StrOutputParser()).invoke(
        {"context": context, "question": question}
    )
    t2 = time.time()

    metrics = QueryMetrics(
        question=question,
        retrieval_ms=(t1 - t0) * 1000,
        generation_ms=(t2 - t1) * 1000,
        total_ms=(t2 - t0) * 1000,
        chunks_retrieved=len(docs),
    )
    return {"answer": answer, "sources": docs}, metrics

result, m = query_with_metrics("What is our SLA for enterprise customers?")
print(f"Total: {m.total_ms:.0f}ms (retrieval={m.retrieval_ms:.0f}ms, generation={m.generation_ms:.0f}ms)")
# Target: retrieval < 200ms, generation < 2000ms, total < 2500ms

Next Steps

Multi-document RAG: Combine structured data (SQL), unstructured text, and real-time APIs in one knowledge graph
Conversational RAG: Add ConversationBufferWindowMemory to maintain session context across turns
Agentic RAG: Let LangGraph decide when to retrieve, when to reason, and when to ask for clarification
Self-querying retriever: LLM generates metadata filters automatically from natural language

For structured training on these advanced topics:

RAG and Agents in Production (3-day intensive) — Advanced RAG, LangGraph multi-agent systems, production deployment on AWS Lambda
Claude API for Developers (2 days) — Build RAG systems with Claude's 200K-token context window for large document sets

Frequently Asked Questions

What makes this RAG tutorial different from others?