Building RAG Systems with LangChain: Practical Tutorial with Working Code

What is RAG and Why Does It Matter?

Retrieval-Augmented Generation (RAG) solves a fundamental limitation of Large Language Models: they don't know about information that didn't exist in their training data. If you ask ChatGPT about your company's internal documentation, a recent research paper, or last quarter's sales data, it can't answer because it was never trained on that information.

RAG bridges this gap by retrieving relevant documents from your knowledge base and including them in the prompt sent to the LLM. The LLM then generates an answer based on the provided context. This approach offers several key advantages:

No fine-tuning required: Update your knowledge base anytime without retraining models
Source attribution: Know exactly which documents were used to generate each answer
Cost-effective: Cheaper than fine-tuning and maintaining custom models
Privacy-friendly: Can run entirely on-premises with local models
Always current: Answers reflect your latest documents, not stale training data

RAG Architecture Overview

A RAG system consists of two main phases: indexing (preprocessing) and retrieval (query time).

Phase 1: Indexing (One-Time Setup)

Load documents: Read PDFs, Word files, web pages, etc.
Split into chunks: Break documents into smaller pieces (500-1000 characters)
Generate embeddings: Convert text chunks into numerical vectors
Store in vector database: Index embeddings for fast similarity search

Phase 2: Retrieval (Every Query)

Embed the query: Convert user question to vector
Search vector database: Find most similar document chunks
Format prompt: Combine query + retrieved chunks
Generate answer: LLM responds based on provided context
Return with sources: Include references to source documents

Step-by-Step Implementation

Step 1: Environment Setup

First, install the required dependencies. We'll use ChromaDB as our vector store (free, open-source, runs locally).

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install core dependencies
pip install langchain langchain-openai langchain-community

# Install document loaders and text splitting
pip install pypdf unstructured

# Install vector store
pip install chromadb

# Install OpenAI client
pip install openai

# For local models (optional - requires Ollama installed)
pip install ollama

Set your OpenAI API key (or skip if using local models):

export OPENAI_API_KEY="your-api-key-here"

# Or create a .env file:
# OPENAI_API_KEY=your-api-key-here

# In Python, load with:
from dotenv import load_dotenv
load_dotenv()

Step 2: Load and Process Documents

Let's start by loading a PDF document. LangChain provides specialized loaders for different formats.

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load PDF
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()

print(f"Loaded {len(documents)} pages from PDF")
print(f"First page preview: {documents[0].page_content[:200]}...")

# Expected output:
# Loaded 45 pages from PDF
# First page preview: Employee Handbook
#
# Welcome to Acme Corporation
#
# This handbook contains important policies and procedures...

Documents are now loaded, but full pages are too large to use as context. We need to split them into smaller chunks.

# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Max characters per chunk
    chunk_overlap=200,      # Overlap between chunks to preserve context
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Try to split on paragraphs first
)

# Split documents into chunks
chunks = text_splitter.split_documents(documents)

print(f"Split into {len(chunks)} chunks")
print(f"\nExample chunk:")
print(f"Content: {chunks[10].page_content}")
print(f"Metadata: {chunks[10].metadata}")

# Expected output:
# Split into 287 chunks
#
# Example chunk:
# Content: Vacation Policy
#
# Full-time employees accrue 15 days of paid vacation per year...
# Metadata: {'source': 'company_handbook.pdf', 'page': 12}

Step 3: Create Embeddings and Vector Store

Now we convert text chunks into numerical vectors (embeddings) and store them in ChromaDB for fast retrieval.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Initialize embeddings model
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"  # Cost: $0.02 per 1M tokens
)

# Create vector store and index documents
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"  # Save to disk for reuse
)

print(f"✅ Indexed {len(chunks)} chunks in vector store")

# For subsequent runs, load existing database:
# vectorstore = Chroma(
#     persist_directory="./chroma_db",
#     embedding_function=embeddings
# )

Using local models instead (no API costs, complete privacy):

# Requires Ollama installed: https://ollama.ai/
# Then run: ollama pull nomic-embed-text

from langchain_community.embeddings import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    model="nomic-embed-text"  # 768-dim embeddings, free, runs locally
)

# Rest of the code is identical - LangChain abstracts the implementation

Step 4: Create Retriever and Test Queries

The retriever searches the vector store for chunks most similar to the user's query.

# Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",  # Other options: "mmr" (diversity), "similarity_score_threshold"
    search_kwargs={"k": 4}     # Retrieve top 4 most relevant chunks
)

# Test retrieval
query = "What is the vacation policy?"
relevant_docs = retriever.invoke(query)

print(f"Query: {query}")
print(f"\nFound {len(relevant_docs)} relevant chunks:")

for i, doc in enumerate(relevant_docs):
    print(f"\n[{i+1}] Page {doc.metadata.get('page', 'N/A')}")
    print(f"Content: {doc.page_content[:200]}...")

# Expected output:
# Query: What is the vacation policy?
#
# Found 4 relevant chunks:
#
# [1] Page 12
# Content: Vacation Policy
#
# Full-time employees accrue 15 days of paid vacation per year...

Step 5: Build the RAG Chain

Now combine the retriever with an LLM to create a complete RAG system. We'll use LangChain's modern LCEL syntax for clarity.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Initialize LLM
llm = ChatOpenAI(
    model="gpt-4o-mini",  # Cost: $0.15 per 1M input tokens
    temperature=0         # Deterministic answers for factual queries
)

# Create prompt template
template = """You are a helpful assistant answering questions based on provided context.

Use the following pieces of context to answer the question at the end.
If you don't know the answer based on the context, say "I don't have enough information to answer that" - don't make up information.

Context:
{context}

Question: {question}

Answer: Provide a clear, concise answer based only on the context above. If you reference specific policies or procedures, mention where they come from."""

prompt = ChatPromptTemplate.from_template(template)

# Helper function to format retrieved documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Create RAG chain using LCEL
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Query the RAG system
question = "How many vacation days do employees get?"
answer = rag_chain.invoke(question)

print(f"Question: {question}")
print(f"\nAnswer: {answer}")

# Expected output:
# Question: How many vacation days do employees get?
#
# Answer: According to the Vacation Policy on page 12, full-time employees
# accrue 15 days of paid vacation per year. Part-time employees accrue vacation
# on a pro-rated basis depending on their scheduled hours.

Step 6: Add Source Attribution

For production systems, you want to return source documents alongside the answer so users can verify information.

from langchain.chains import RetrievalQA

# Alternative: use RetrievalQA for automatic source tracking
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # "stuff" = include all docs in one prompt
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={
        "prompt": prompt,
    }
)

# Query with sources
result = qa_chain.invoke({"query": "What is the remote work policy?"})

print(f"Question: {result['query']}")
print(f"\nAnswer: {result['result']}")
print(f"\nSources:")

for i, doc in enumerate(result['source_documents']):
    print(f"  [{i+1}] Page {doc.metadata.get('page', 'N/A')} - {doc.metadata.get('source', 'Unknown')}")
    print(f"      Preview: {doc.page_content[:150]}...")

# Expected output:
# Question: What is the remote work policy?
#
# Answer: Employees may work remotely up to 2 days per week with manager
# approval. Remote work arrangements must be documented in writing and
# reviewed quarterly.
#
# Sources:
#   [1] Page 18 - company_handbook.pdf
#       Preview: Remote Work Policy
#
#       To support work-life balance, Acme permits remote work under
#       the following conditions...

Complete Working Example

Here's a complete, runnable RAG implementation you can use as a starting point:

# rag_system.py
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import os

class RAGSystem:
    def __init__(self, pdf_path: str, persist_dir: str = "./chroma_db"):
        """Initialize RAG system with a PDF document."""
        self.pdf_path = pdf_path
        self.persist_dir = persist_dir

        # Initialize components
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

        # Load or create vector store
        if os.path.exists(persist_dir):
            print("Loading existing vector store...")
            self.vectorstore = Chroma(
                persist_directory=persist_dir,
                embedding_function=self.embeddings
            )
        else:
            print("Creating new vector store...")
            self._index_documents()

        # Create retriever
        self.retriever = self.vectorstore.as_retriever(
            search_kwargs={"k": 4}
        )

        # Create RAG chain
        self._create_chain()

    def _index_documents(self):
        """Load, split, and index documents."""
        # Load PDF
        loader = PyPDFLoader(self.pdf_path)
        documents = loader.load()
        print(f"Loaded {len(documents)} pages")

        # Split into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", " ", ""]
        )
        chunks = text_splitter.split_documents(documents)
        print(f"Split into {len(chunks)} chunks")

        # Create and persist vector store
        self.vectorstore = Chroma.from_documents(
            documents=chunks,
            embedding=self.embeddings,
            persist_directory=self.persist_dir
        )
        print(f"✅ Indexed {len(chunks)} chunks")

    def _create_chain(self):
        """Create the RAG chain."""
        template = """Answer the question based on the following context.
If you don't know, say "I don't have enough information" - don't make up information.

Context:
{context}

Question: {question}

Answer:"""

        prompt = ChatPromptTemplate.from_template(template)

        def format_docs(docs):
            return "\n\n".join(doc.page_content for doc in docs)

        self.chain = (
            {"context": self.retriever | format_docs, "question": RunnablePassthrough()}
            | prompt
            | self.llm
            | StrOutputParser()
        )

    def query(self, question: str) -> dict:
        """Query the RAG system and return answer with sources."""
        # Get answer
        answer = self.chain.invoke(question)

        # Get source documents
        sources = self.retriever.invoke(question)

        return {
            "question": question,
            "answer": answer,
            "sources": [
                {
                    "page": doc.metadata.get("page", "N/A"),
                    "content": doc.page_content[:200] + "..."
                }
                for doc in sources
            ]
        }

# Usage example
if __name__ == "__main__":
    # Initialize RAG system
    rag = RAGSystem("company_handbook.pdf")

    # Ask questions
    questions = [
        "What is the vacation policy?",
        "How many sick days do employees get?",
        "What is the remote work policy?"
    ]

    for q in questions:
        result = rag.query(q)
        print(f"\nQ: {result['question']}")
        print(f"A: {result['answer']}")
        print(f"Sources: {len(result['sources'])} documents")
        print("-" * 80)

Best Practices for Production

1. Optimize Chunk Size and Overlap

The right chunk size depends on your use case. Test with different values:

Technical docs: 500-1000 characters, 100-200 overlap
Legal documents: 1500-2000 characters (preserve clause context)
Chat logs: Split by message or timestamp
Code files: Split by function or class definition

2. Implement Caching for Embeddings

Avoid re-computing embeddings for the same queries:

from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore

# Create cache
store = LocalFileStore("./embedding_cache/")

# Wrap embeddings with cache
cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings=OpenAIEmbeddings(),
    document_embedding_cache=store,
    namespace="openai_embeddings"
)

# Use cached_embeddings in vectorstore
# Reduces costs by 60% for repeated queries

3. Add Metadata Filtering

Filter retrieval by document type, date, author, etc:

# When loading documents, add metadata
for doc in documents:
    doc.metadata["department"] = "HR"
    doc.metadata["last_updated"] = "2026-01-15"

# Filter retrieval
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 4,
        "filter": {"department": "HR"}  # Only retrieve HR documents
    }
)

4. Monitor and Log Performance

import time
from datetime import datetime

def query_with_metrics(rag_system, question):
    start = time.time()

    result = rag_system.query(question)

    elapsed = time.time() - start

    # Log metrics
    print(f"[{datetime.now()}] Query: {question[:50]}...")
    print(f"  Response time: {elapsed:.2f}s")
    print(f"  Sources retrieved: {len(result['sources'])}")
    print(f"  Answer length: {len(result['answer'])} chars")

    return result

Using Local Models for Complete Privacy

For sensitive documents, run everything locally with Ollama:

# Install Ollama: https://ollama.ai/
# Pull models:
# ollama pull llama3.3:70b
# ollama pull nomic-embed-text

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings

# Use local models
embeddings = OllamaEmbeddings(model="nomic-embed-text")
llm = Ollama(model="llama3.3:70b", temperature=0)

# Everything else stays the same
# Vector store, retriever, chain logic is identical
# Now runs 100% locally with zero API costs

Common Issues and Solutions

Problem	Cause	Solution
Answers lack context	Chunks too small or k too low	Increase chunk_size to 1500 or k to 6-8
Irrelevant info in answers	Chunks too large, poor retrieval	Decrease chunk_size to 500, add metadata filters
Slow query responses	No embedding cache, large k	Implement CacheBackedEmbeddings, reduce k to 3-4
High API costs	Using gpt-4, no caching	Switch to gpt-4o-mini, cache embeddings, consider local models
"I don't know" for known info	Poor chunking or embeddings	Adjust separators, try different embedding model, check document quality

Next Steps

Now that you have a working RAG system, consider these enhancements:

Advanced retrieval: Implement hybrid search (keyword + semantic), reranking, or parent-document retrieval
Multiple data sources: Combine PDFs, web pages, databases, APIs into one knowledge base
Conversation memory: Add ConversationBufferMemory for multi-turn Q&A sessions
Web interface: Build a Streamlit or FastAPI frontend for your RAG system
Production deployment: Deploy to AWS Lambda or Docker containers with horizontal scaling

For comprehensive professional training on RAG systems and LLM applications:

RAG and Agents in Production (3-day intensive): Advanced RAG techniques, LangChain, LlamaIndex, vector database optimization, deployment patterns
Claude API for Developers (2 days): Master Claude 4.5 for RAG applications, extended context windows (200K tokens), prompt optimization

Frequently Asked Questions

Why use RAG instead of just prompting an LLM directly?

LLMs are trained on data with a cutoff date and don't know about your private documents. RAG retrieves relevant context from your documents and includes it in the prompt, enabling the LLM to answer questions about information it was never trained on. This eliminates the need for expensive fine-tuning and provides source attribution for answers.

Can I use RAG with local/open-source models instead of OpenAI?

Absolutely. LangChain supports Ollama for local inference (use ChatOllama and OllamaEmbeddings). You can run Llama 3.3 70B or Mistral locally with zero API costs. For embeddings, nomic-embed-text via Ollama works excellently. This gives you complete data privacy and eliminates ongoing API costs.

What's the best chunk size for document splitting in RAG?

There's no universal answer - it depends on your use case. For technical documentation: 500-1000 characters with 100-200 overlap. For legal documents: 1500-2000 characters (preserve clause context). For chat transcripts: split by turn or timestamp. The key is testing retrieval quality - if answers lack context, increase chunk size; if irrelevant content appears, decrease it.

How do I handle multiple document formats (PDF, Word, HTML)?

LangChain provides specialized loaders: PyPDFLoader for PDFs, UnstructuredWordDocumentLoader for Word, WebBaseLoader for HTML. For production, use UnstructuredFileLoader which auto-detects format. All loaders return the same Document format, so your RAG pipeline stays consistent regardless of source format.

What's the typical response time and cost for a RAG query?

With OpenAI gpt-4o-mini and text-embedding-3-small: ~1-2 seconds per query, costing ~$0.002 per query (0.5¢ embeddings + 1.5¢ generation for 4 retrieved chunks). With local Llama via Ollama: 2-5 seconds (depending on GPU), $0 per query. For production, add caching (Redis) to reduce embedding costs by 60% for repeated queries.