Talki Academy
Tutorial22 min read

RAG with LangChain: Complete Implementation and Evaluation Guide (2026)

Most RAG tutorials stop when the first answer appears. This guide goes further: from environment setup and document loading through advanced retrieval strategies (MMR, hybrid search, reranking) to systematic evaluation with RAGAS. Every section includes working Python code and concrete performance numbers so you can measure improvement, not just guess at it.

By Talki Academy·Published April 7, 2026

1. What is RAG and When to Use It

Retrieval-Augmented Generation (RAG) connects an LLM to an external knowledge base at query time. Instead of relying solely on what the model learned during training, RAG retrieves relevant document chunks and injects them into the prompt as context.

Use RAG when:

  • Your documents change frequently (internal wikis, product catalogs, legal contracts)
  • Data is private and was never in any training set (customer records, internal reports)
  • You need source attribution — users must be able to verify answers
  • Fine-tuning is too expensive or too slow to iterate on

Do not use RAG when: the model already knows the domain well (general coding questions, common knowledge), or when you need the LLM to learn a new behavior rather than new facts — that's what fine-tuning is for.

Cost comparison: Fine-tuning GPT-4o costs ~$25/1M training tokens + $5-15/hour compute. Updating a RAG knowledge base costs ~$0.02/1M tokens for re-embedding changed chunks. For documents that change weekly, RAG is 100-500x cheaper to keep current.

2. Environment Setup

We'll use LangChain 0.3+, ChromaDB (open-source, runs locally), and support both OpenAI and Ollama (free, local inference). Python 3.11+ required.

# Create and activate virtual environment python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate # Core LangChain packages pip install langchain==0.3.0 langchain-community==0.3.0 langchain-openai==0.2.0 # Document loaders pip install pypdf unstructured[pdf] python-docx # Vector store pip install chromadb==0.5.0 # Evaluation framework pip install ragas==0.2.0 datasets # Optional: local inference (no API costs) # Install Ollama from https://ollama.ai/, then: # ollama pull llama3.3:70b # ollama pull nomic-embed-text pip install langchain-ollama # LangChain integration for Ollama

Configure API keys — or skip if using Ollama:

# .env file (never commit this) OPENAI_API_KEY=sk-... # Load in Python from dotenv import load_dotenv load_dotenv() # Verify import openai print(openai.models.list().data[0].id) # Should print a model name

3. Document Loading and Processing

LangChain provides loaders for 50+ formats. All return a list of Document objects with page_content and metadata, so the rest of your pipeline is format-agnostic.

from langchain_community.document_loaders import ( PyPDFLoader, UnstructuredWordDocumentLoader, WebBaseLoader, DirectoryLoader, ) from langchain.text_splitter import RecursiveCharacterTextSplitter from pathlib import Path # ── Single file loaders ────────────────────────────────────────────────────── pdf_docs = PyPDFLoader("annual_report.pdf").load() word_docs = UnstructuredWordDocumentLoader("contract.docx").load() web_docs = WebBaseLoader("https://docs.example.com/api").load() # ── Load an entire directory of PDFs ──────────────────────────────────────── dir_loader = DirectoryLoader( "./knowledge_base/", glob="**/*.pdf", loader_cls=PyPDFLoader, show_progress=True, ) all_docs = dir_loader.load() print(f"Loaded {len(all_docs)} pages from {len(set(d.metadata['source'] for d in all_docs))} files") # ── Add custom metadata before splitting ──────────────────────────────────── for doc in all_docs: doc.metadata["department"] = "engineering" doc.metadata["indexed_at"] = "2026-04-07" # ── Split into chunks ──────────────────────────────────────────────────────── # RecursiveCharacterTextSplitter tries paragraph → sentence → word boundaries splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Characters per chunk (not tokens) chunk_overlap=200, # Overlap preserves context across chunk boundaries separators=["\n\n", "\n", ". ", " ", ""], ) chunks = splitter.split_documents(all_docs) print(f"Split into {len(chunks)} chunks") print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars") # Expected output: # Loaded 87 pages from 5 files # Split into 412 chunks # Average chunk size: 847 chars
Chunk size rules of thumb: Technical docs → 800-1200 chars. Legal text → 1500-2000 chars (clauses need full context). FAQ entries → 300-500 chars (one Q&A pair per chunk). Code files → split by function or class, not character count.

4. Indexing: Embeddings and Vector Store

Embeddings convert text into numerical vectors. Semantically similar text produces similar vectors, enabling fast nearest-neighbor search. We'll use ChromaDB — open-source, zero infrastructure, runs in-process or as a server.

from langchain_openai import OpenAIEmbeddings from langchain_ollama import OllamaEmbeddings # local alternative from langchain_community.vectorstores import Chroma import os # ── Option A: OpenAI embeddings (cost: $0.02 per 1M tokens) ───────────────── embeddings_openai = OpenAIEmbeddings(model="text-embedding-3-small") # ── Option B: Local Ollama (cost: $0, requires GPU or fast CPU) ────────────── embeddings_local = OllamaEmbeddings(model="nomic-embed-text") # 768 dimensions # Choose one: embeddings = embeddings_openai # or embeddings_local PERSIST_DIR = "./chroma_db" if os.path.exists(PERSIST_DIR): # Load existing index — no re-embedding needed vectorstore = Chroma( persist_directory=PERSIST_DIR, embedding_function=embeddings, ) print(f"Loaded existing vector store: {vectorstore._collection.count()} chunks") else: # First-time indexing vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory=PERSIST_DIR, collection_metadata={"hnsw:space": "cosine"}, # cosine similarity ) print(f"Indexed {vectorstore._collection.count()} chunks") # ── Incremental update (add new docs without reindexing everything) ────────── new_chunks = splitter.split_documents([new_doc]) vectorstore.add_documents(new_chunks) print("Incremental update complete")

5. Retrieval Strategies: Similarity, MMR, and Hybrid

Retrieval is the most impactful variable in RAG quality. The embedding model and chunk size set the ceiling; retrieval strategy determines how close you get to it.

5a. Similarity Search (Baseline)

Returns the k chunks with the highest cosine similarity to the query vector. Fast, simple, sufficient for small homogeneous knowledge bases.

# Basic similarity retriever retriever_sim = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 4}, ) results = retriever_sim.invoke("What is our refund policy?") for doc in results: print(f"[{doc.metadata.get('source', '?')}] {doc.page_content[:120]}...") # Weakness: may return 4 near-identical chunks from the same section

5b. MMR — Maximal Marginal Relevance

MMR balances relevance and diversity. It retrieves candidates by similarity but then iteratively selects the one most relevant to the query and least similar to already-selected chunks. Use this when your corpus has many similar documents.

# MMR retriever retriever_mmr = vectorstore.as_retriever( search_type="mmr", search_kwargs={ "k": 4, # Final number of chunks returned "fetch_k": 20, # Initial candidate pool to select from "lambda_mult": 0.6, # 0 = max diversity, 1 = max relevance (0.5-0.7 is sweet spot) }, ) # MMR example: refund policy query # Similarity returns: [refund clause, refund clause (paraphrase), refund clause (alt), refund FAQ] # MMR returns: [refund clause, returns policy, exchange policy, shipping policy] # → Much more informative context for the LLM

5c. Hybrid Search (BM25 + Semantic)

Hybrid combines keyword search (BM25, exact term matching) with semantic search (embedding similarity). It closes the "vocabulary mismatch" problem: semantic search misses exact product codes or names; BM25 misses paraphrases. Together they win on both.

from langchain_community.retrievers import BM25Retriever from langchain.retrievers import EnsembleRetriever # BM25 keyword retriever (no embeddings, pure term frequency) bm25_retriever = BM25Retriever.from_documents(chunks) bm25_retriever.k = 4 # Semantic retriever semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) # Hybrid: weighted combination (0.5 = equal weight) hybrid_retriever = EnsembleRetriever( retrievers=[bm25_retriever, semantic_retriever], weights=[0.4, 0.6], # Give semantic search slightly more weight ) # Hybrid excels on queries with specific identifiers query = "What does section 4.2.1 say about data retention?" # BM25 finds "4.2.1" exactly; semantic finds "data retention policy" context results = hybrid_retriever.invoke(query) print(f"Hybrid retrieved {len(results)} unique chunks")

5d. Reranking with a Cross-Encoder

Reranking runs a second, more accurate model to re-score the initial retrieval results. Adds ~200ms latency but typically improves Recall@4 by 8-12 percentage points. Worth it for production systems.

from langchain.retrievers import ContextualCompressionRetriever from langchain_community.cross_encoders import HuggingFaceCrossEncoder from langchain.retrievers.document_compressors import CrossEncoderReranker # Load reranker model (runs locally, no API key needed) # First run downloads ~67MB model model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base") compressor = CrossEncoderReranker(model=model, top_n=3) # Wrap any retriever with reranking reranking_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=hybrid_retriever, # Fetch more candidates, rerank to top 3 ) results = reranking_retriever.invoke("data retention for EU customers") # Results are now sorted by cross-encoder score, not just embedding similarity

6. Generation Chain with Source Attribution

With a retriever ready, we build the generation chain using LangChain Expression Language (LCEL). The chain is retriever → prompt → LLM → parser, all composable with the | operator.

from langchain_openai import ChatOpenAI from langchain_ollama import ChatOllama # local alternative from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough, RunnableParallel # ── LLM setup ──────────────────────────────────────────────────────────────── llm_openai = ChatOpenAI(model="gpt-4o-mini", temperature=0) llm_local = ChatOllama(model="llama3.3:70b", temperature=0) llm = llm_openai # switch to llm_local for zero API cost # ── Prompt template ─────────────────────────────────────────────────────────── SYSTEM = """You are a precise assistant that answers questions based strictly on the provided context. Rules: - Only use information from the context below. - If the context does not contain the answer, say "This information is not in the provided documents" — do not hallucinate. - Cite the source document when referencing specific facts. Context: {context}""" prompt = ChatPromptTemplate.from_messages([ ("system", SYSTEM), ("human", "{question}"), ]) def format_docs(docs: list) -> str: """Format retrieved documents with source labels.""" parts = [] for i, doc in enumerate(docs, 1): source = doc.metadata.get("source", "unknown") parts.append(f"[Source {i}: {source}]\n{doc.page_content}") return "\n\n---\n\n".join(parts) # ── RAG chain with source attribution ──────────────────────────────────────── rag_chain_with_sources = RunnableParallel( { "answer": ( {"context": reranking_retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ), "sources": reranking_retriever, } ) # ── Query ───────────────────────────────────────────────────────────────────── result = rag_chain_with_sources.invoke("What is the data retention period for EU users?") print("Answer:", result["answer"]) print("\nSources:") for doc in result["sources"]: print(f" - {doc.metadata.get('source')} (page {doc.metadata.get('page', '?')})") # Expected output: # Answer: According to [Source 1: privacy_policy.pdf], EU users' data is retained # for 30 days after account deletion, in compliance with Article 17 GDPR. # # Sources: # - privacy_policy.pdf (page 4) # - data_processing_agreement.pdf (page 12)

7. Evaluation with RAGAS

This section is the core differentiator of this guide. Without evaluation, you can't know if switching from similarity to MMR actually helped, or if a prompt change improved faithfulness. RAGAS gives you four key metrics:

MetricWhat it measuresTarget
FaithfulnessAnswer only claims facts present in retrieved context> 0.85
Answer RelevanceAnswer actually addresses the question asked> 0.80
Context PrecisionRetrieved chunks contain relevant information> 0.75
Context RecallAll necessary information was retrieved> 0.80

7a. Build an Evaluation Dataset

RAGAS requires a test set of questions, expected answers, and retrieved contexts. Start with 20-30 hand-crafted Q&A pairs covering important topics in your knowledge base.

from datasets import Dataset # Hand-crafted evaluation set (minimum viable: 20 examples) eval_data = [ { "question": "What is the refund period for digital products?", "ground_truth": "Digital products are eligible for a 14-day refund if not downloaded.", }, { "question": "Which data centers store EU customer data?", "ground_truth": "EU customer data is stored exclusively in Frankfurt (eu-central-1) and Dublin (eu-west-1) AWS regions.", }, { "question": "What is the maximum file size for uploads?", "ground_truth": "The maximum upload file size is 500MB per file, with a 5GB total per account per day.", }, # Add 17+ more examples covering your knowledge base... ] # Retrieve contexts for each question def build_eval_dataset(eval_data: list, retriever) -> Dataset: rows = [] for item in eval_data: docs = retriever.invoke(item["question"]) rows.append({ "question": item["question"], "ground_truth": item["ground_truth"], "contexts": [doc.page_content for doc in docs], "answer": rag_chain_with_sources.invoke(item["question"])["answer"], }) return Dataset.from_list(rows) eval_dataset = build_eval_dataset(eval_data, reranking_retriever) print(f"Evaluation dataset: {len(eval_dataset)} examples")

7b. Run RAGAS Evaluation

from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, ) # Evaluate all four metrics results = evaluate( dataset=eval_dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], llm=llm, # Uses your LLM to score answers embeddings=embeddings_openai, # Uses embeddings for relevance scoring ) print(results) # Expected output (good RAG system): # {'faithfulness': 0.89, 'answer_relevancy': 0.84, # 'context_precision': 0.78, 'context_recall': 0.82} # Export detailed per-question results df = results.to_pandas() df.to_csv("rag_evaluation_results.csv", index=False) # Find worst-performing questions worst = df.nsmallest(5, "faithfulness")[["question", "faithfulness", "contexts"]] print("\nTop 5 low-faithfulness questions:") print(worst.to_string())

7c. Interpreting Results and Iterating

SymptomRoot CauseFix
Low Faithfulness (<0.75)LLM adds information not in contextStrengthen system prompt: "Only use provided context". Use temperature=0.
Low Context Precision (<0.60)Retrieval brings in off-topic chunksReduce chunk size, add metadata filters, switch to MMR or hybrid
Low Context Recall (<0.70)Relevant info not retrieved at allIncrease k, check chunk overlap, try reranking
Low Answer Relevance (<0.70)Answer drifts from question; poor promptAdd explicit instruction to stay on topic, improve prompt template
# A/B test retrieval strategies using RAGAS def benchmark_retriever(retriever, label: str): dataset = build_eval_dataset(eval_data, retriever) scores = evaluate(dataset, metrics=[faithfulness, context_precision, context_recall], llm=llm, embeddings=embeddings_openai) print(f"\n{label}:") print(f" Faithfulness: {scores['faithfulness']:.3f}") print(f" Context Precision: {scores['context_precision']:.3f}") print(f" Context Recall: {scores['context_recall']:.3f}") return scores benchmark_retriever(retriever_sim, "Similarity (baseline)") benchmark_retriever(retriever_mmr, "MMR") benchmark_retriever(hybrid_retriever, "Hybrid BM25+Semantic") benchmark_retriever(reranking_retriever, "Hybrid + Reranking") # Sample real-world results on a 5000-chunk knowledge base: # Similarity (baseline): Faithfulness=0.79 Precision=0.64 Recall=0.71 # MMR: Faithfulness=0.81 Precision=0.70 Recall=0.73 # Hybrid BM25+Semantic: Faithfulness=0.84 Precision=0.76 Recall=0.79 # Hybrid + Reranking: Faithfulness=0.89 Precision=0.81 Recall=0.84 # → Reranking adds +10pp faithfulness vs. baseline at ~200ms latency cost

8. Production Best Practices

Cache Embeddings to Cut Costs

from langchain.embeddings import CacheBackedEmbeddings from langchain.storage import LocalFileStore store = LocalFileStore("./embedding_cache/") cached_embeddings = CacheBackedEmbeddings.from_bytes_store( underlying_embeddings=embeddings_openai, document_embedding_cache=store, namespace="text-embedding-3-small", ) # Repeated queries hit disk cache, not OpenAI — saves ~60% of embedding costs

Add Guardrails for Safety

def safe_query(question: str, max_length: int = 500) -> dict: """Validate and sanitize input before RAG processing.""" # Input length check if len(question) > max_length: return {"error": "Question too long. Please be more concise."} # Block prompt injection attempts injection_patterns = ["ignore previous", "system prompt", "jailbreak"] if any(p in question.lower() for p in injection_patterns): return {"error": "Invalid query format."} return rag_chain_with_sources.invoke(question)

Monitor Latency in Production

import time from dataclasses import dataclass @dataclass class QueryMetrics: question: str retrieval_ms: float generation_ms: float total_ms: float chunks_retrieved: int def query_with_metrics(question: str) -> tuple[dict, QueryMetrics]: t0 = time.time() docs = reranking_retriever.invoke(question) t1 = time.time() context = format_docs(docs) answer = (prompt | llm | StrOutputParser()).invoke( {"context": context, "question": question} ) t2 = time.time() metrics = QueryMetrics( question=question, retrieval_ms=(t1 - t0) * 1000, generation_ms=(t2 - t1) * 1000, total_ms=(t2 - t0) * 1000, chunks_retrieved=len(docs), ) return {"answer": answer, "sources": docs}, metrics result, m = query_with_metrics("What is our SLA for enterprise customers?") print(f"Total: {m.total_ms:.0f}ms (retrieval={m.retrieval_ms:.0f}ms, generation={m.generation_ms:.0f}ms)") # Target: retrieval < 200ms, generation < 2000ms, total < 2500ms

Next Steps

  • Multi-document RAG: Combine structured data (SQL), unstructured text, and real-time APIs in one knowledge graph
  • Conversational RAG: Add ConversationBufferWindowMemory to maintain session context across turns
  • Agentic RAG: Let LangGraph decide when to retrieve, when to reason, and when to ask for clarification
  • Self-querying retriever: LLM generates metadata filters automatically from natural language

For structured training on these advanced topics:

Frequently Asked Questions

What makes this RAG tutorial different from others?

Most tutorials stop at 'it works'. This guide includes evaluation with RAGAS — so you can measure faithfulness, answer relevance, and context recall with real numbers, not gut feelings. It also covers advanced retrieval strategies (MMR, hybrid search, reranking) that close the gap from prototype to production.

Do I need an OpenAI API key to follow this tutorial?

No. Every code example has a local alternative using Ollama (free, runs on your machine). The OpenAI version is shown first for clarity, but Ollama alternatives are always provided. You need at least 16GB RAM and ~10GB disk for the local models.

What is RAGAS and why should I use it?

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that automatically evaluates four key metrics: Faithfulness (does the answer match the retrieved context?), Answer Relevance (does the answer address the question?), Context Precision (is the retrieved context on-topic?), and Context Recall (are all relevant facts retrieved?). Without evaluation, you're flying blind — RAG systems that 'feel good' often score below 70% on faithfulness.

When should I use MMR instead of similarity search?

Use MMR (Maximal Marginal Relevance) when your knowledge base has many similar documents and you get repetitive context in your answers. MMR explicitly penalizes redundancy, retrieving diverse-but-relevant chunks instead of the top-k most similar ones. Typical case: a product FAQ where many questions overlap — MMR retrieves one relevant chunk per topic instead of five paraphrases of the same answer.

How do I handle documents that are updated frequently?

Use ChromaDB's upsert with stable document IDs derived from content hash or file path + modification date. For daily updates: pull new/changed files, upsert only changed chunks (keep IDs stable for unchanged content). Weekly: run a full consistency check comparing your file system to the vector store and delete orphaned chunks. This avoids full reindexing which can take hours on large corpora.

Build Production-Ready RAG Systems

Professional training programs for developers who want to go beyond tutorials and ship AI systems that work.

View Training ProgramsContact Us