What is RAG and Why Does It Matter?
Retrieval-Augmented Generation (RAG) solves a fundamental limitation of Large Language Models: they don't know about information that didn't exist in their training data. If you ask ChatGPT about your company's internal documentation, a recent research paper, or last quarter's sales data, it can't answer because it was never trained on that information.
RAG bridges this gap by retrieving relevant documents from your knowledge base and including them in the prompt sent to the LLM. The LLM then generates an answer based on the provided context. This approach offers several key advantages:
- No fine-tuning required: Update your knowledge base anytime without retraining models
- Source attribution: Know exactly which documents were used to generate each answer
- Cost-effective: Cheaper than fine-tuning and maintaining custom models
- Privacy-friendly: Can run entirely on-premises with local models
- Always current: Answers reflect your latest documents, not stale training data
RAG Architecture Overview
A RAG system consists of two main phases: indexing (preprocessing) and retrieval (query time).
Phase 1: Indexing (One-Time Setup)
- Load documents: Read PDFs, Word files, web pages, etc.
- Split into chunks: Break documents into smaller pieces (500-1000 characters)
- Generate embeddings: Convert text chunks into numerical vectors
- Store in vector database: Index embeddings for fast similarity search
Phase 2: Retrieval (Every Query)
- Embed the query: Convert user question to vector
- Search vector database: Find most similar document chunks
- Format prompt: Combine query + retrieved chunks
- Generate answer: LLM responds based on provided context
- Return with sources: Include references to source documents
Step-by-Step Implementation
Step 1: Environment Setup
First, install the required dependencies. We'll use ChromaDB as our vector store (free, open-source, runs locally).
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install core dependencies
pip install langchain langchain-openai langchain-community
# Install document loaders and text splitting
pip install pypdf unstructured
# Install vector store
pip install chromadb
# Install OpenAI client
pip install openai
# For local models (optional - requires Ollama installed)
pip install ollama
Set your OpenAI API key (or skip if using local models):
export OPENAI_API_KEY="your-api-key-here"
# Or create a .env file:
# OPENAI_API_KEY=your-api-key-here
# In Python, load with:
from dotenv import load_dotenv
load_dotenv()
Step 2: Load and Process Documents
Let's start by loading a PDF document. LangChain provides specialized loaders for different formats.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load PDF
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages from PDF")
print(f"First page preview: {documents[0].page_content[:200]}...")
# Expected output:
# Loaded 45 pages from PDF
# First page preview: Employee Handbook
#
# Welcome to Acme Corporation
#
# This handbook contains important policies and procedures...
Documents are now loaded, but full pages are too large to use as context. We need to split them into smaller chunks.
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Max characters per chunk
chunk_overlap=200, # Overlap between chunks to preserve context
length_function=len,
separators=["\n\n", "\n", " ", ""] # Try to split on paragraphs first
)
# Split documents into chunks
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
print(f"\nExample chunk:")
print(f"Content: {chunks[10].page_content}")
print(f"Metadata: {chunks[10].metadata}")
# Expected output:
# Split into 287 chunks
#
# Example chunk:
# Content: Vacation Policy
#
# Full-time employees accrue 15 days of paid vacation per year...
# Metadata: {'source': 'company_handbook.pdf', 'page': 12}
Step 3: Create Embeddings and Vector Store
Now we convert text chunks into numerical vectors (embeddings) and store them in ChromaDB for fast retrieval.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Initialize embeddings model
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small" # Cost: $0.02 per 1M tokens
)
# Create vector store and index documents
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db" # Save to disk for reuse
)
print(f"✅ Indexed {len(chunks)} chunks in vector store")
# For subsequent runs, load existing database:
# vectorstore = Chroma(
# persist_directory="./chroma_db",
# embedding_function=embeddings
# )
Using local models instead (no API costs, complete privacy):
# Requires Ollama installed: https://ollama.ai/
# Then run: ollama pull nomic-embed-text
from langchain_community.embeddings import OllamaEmbeddings
embeddings = OllamaEmbeddings(
model="nomic-embed-text" # 768-dim embeddings, free, runs locally
)
# Rest of the code is identical - LangChain abstracts the implementation
Step 4: Create Retriever and Test Queries
The retriever searches the vector store for chunks most similar to the user's query.
# Create retriever
retriever = vectorstore.as_retriever(
search_type="similarity", # Other options: "mmr" (diversity), "similarity_score_threshold"
search_kwargs={"k": 4} # Retrieve top 4 most relevant chunks
)
# Test retrieval
query = "What is the vacation policy?"
relevant_docs = retriever.invoke(query)
print(f"Query: {query}")
print(f"\nFound {len(relevant_docs)} relevant chunks:")
for i, doc in enumerate(relevant_docs):
print(f"\n[{i+1}] Page {doc.metadata.get('page', 'N/A')}")
print(f"Content: {doc.page_content[:200]}...")
# Expected output:
# Query: What is the vacation policy?
#
# Found 4 relevant chunks:
#
# [1] Page 12
# Content: Vacation Policy
#
# Full-time employees accrue 15 days of paid vacation per year...
Step 5: Build the RAG Chain
Now combine the retriever with an LLM to create a complete RAG system. We'll use LangChain's modern LCEL syntax for clarity.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# Initialize LLM
llm = ChatOpenAI(
model="gpt-4o-mini", # Cost: $0.15 per 1M input tokens
temperature=0 # Deterministic answers for factual queries
)
# Create prompt template
template = """You are a helpful assistant answering questions based on provided context.
Use the following pieces of context to answer the question at the end.
If you don't know the answer based on the context, say "I don't have enough information to answer that" - don't make up information.
Context:
{context}
Question: {question}
Answer: Provide a clear, concise answer based only on the context above. If you reference specific policies or procedures, mention where they come from."""
prompt = ChatPromptTemplate.from_template(template)
# Helper function to format retrieved documents
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Create RAG chain using LCEL
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Query the RAG system
question = "How many vacation days do employees get?"
answer = rag_chain.invoke(question)
print(f"Question: {question}")
print(f"\nAnswer: {answer}")
# Expected output:
# Question: How many vacation days do employees get?
#
# Answer: According to the Vacation Policy on page 12, full-time employees
# accrue 15 days of paid vacation per year. Part-time employees accrue vacation
# on a pro-rated basis depending on their scheduled hours.
Step 6: Add Source Attribution
For production systems, you want to return source documents alongside the answer so users can verify information.
from langchain.chains import RetrievalQA
# Alternative: use RetrievalQA for automatic source tracking
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" = include all docs in one prompt
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={
"prompt": prompt,
}
)
# Query with sources
result = qa_chain.invoke({"query": "What is the remote work policy?"})
print(f"Question: {result['query']}")
print(f"\nAnswer: {result['result']}")
print(f"\nSources:")
for i, doc in enumerate(result['source_documents']):
print(f" [{i+1}] Page {doc.metadata.get('page', 'N/A')} - {doc.metadata.get('source', 'Unknown')}")
print(f" Preview: {doc.page_content[:150]}...")
# Expected output:
# Question: What is the remote work policy?
#
# Answer: Employees may work remotely up to 2 days per week with manager
# approval. Remote work arrangements must be documented in writing and
# reviewed quarterly.
#
# Sources:
# [1] Page 18 - company_handbook.pdf
# Preview: Remote Work Policy
#
# To support work-life balance, Acme permits remote work under
# the following conditions...
Complete Working Example
Here's a complete, runnable RAG implementation you can use as a starting point:
# rag_system.py
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import os
class RAGSystem:
def __init__(self, pdf_path: str, persist_dir: str = "./chroma_db"):
"""Initialize RAG system with a PDF document."""
self.pdf_path = pdf_path
self.persist_dir = persist_dir
# Initialize components
self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
self.llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Load or create vector store
if os.path.exists(persist_dir):
print("Loading existing vector store...")
self.vectorstore = Chroma(
persist_directory=persist_dir,
embedding_function=self.embeddings
)
else:
print("Creating new vector store...")
self._index_documents()
# Create retriever
self.retriever = self.vectorstore.as_retriever(
search_kwargs={"k": 4}
)
# Create RAG chain
self._create_chain()
def _index_documents(self):
"""Load, split, and index documents."""
# Load PDF
loader = PyPDFLoader(self.pdf_path)
documents = loader.load()
print(f"Loaded {len(documents)} pages")
# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
# Create and persist vector store
self.vectorstore = Chroma.from_documents(
documents=chunks,
embedding=self.embeddings,
persist_directory=self.persist_dir
)
print(f"✅ Indexed {len(chunks)} chunks")
def _create_chain(self):
"""Create the RAG chain."""
template = """Answer the question based on the following context.
If you don't know, say "I don't have enough information" - don't make up information.
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
self.chain = (
{"context": self.retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| self.llm
| StrOutputParser()
)
def query(self, question: str) -> dict:
"""Query the RAG system and return answer with sources."""
# Get answer
answer = self.chain.invoke(question)
# Get source documents
sources = self.retriever.invoke(question)
return {
"question": question,
"answer": answer,
"sources": [
{
"page": doc.metadata.get("page", "N/A"),
"content": doc.page_content[:200] + "..."
}
for doc in sources
]
}
# Usage example
if __name__ == "__main__":
# Initialize RAG system
rag = RAGSystem("company_handbook.pdf")
# Ask questions
questions = [
"What is the vacation policy?",
"How many sick days do employees get?",
"What is the remote work policy?"
]
for q in questions:
result = rag.query(q)
print(f"\nQ: {result['question']}")
print(f"A: {result['answer']}")
print(f"Sources: {len(result['sources'])} documents")
print("-" * 80)
Best Practices for Production
1. Optimize Chunk Size and Overlap
The right chunk size depends on your use case. Test with different values:
- Technical docs: 500-1000 characters, 100-200 overlap
- Legal documents: 1500-2000 characters (preserve clause context)
- Chat logs: Split by message or timestamp
- Code files: Split by function or class definition
2. Implement Caching for Embeddings
Avoid re-computing embeddings for the same queries:
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore
# Create cache
store = LocalFileStore("./embedding_cache/")
# Wrap embeddings with cache
cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
underlying_embeddings=OpenAIEmbeddings(),
document_embedding_cache=store,
namespace="openai_embeddings"
)
# Use cached_embeddings in vectorstore
# Reduces costs by 60% for repeated queries
3. Add Metadata Filtering
Filter retrieval by document type, date, author, etc:
# When loading documents, add metadata
for doc in documents:
doc.metadata["department"] = "HR"
doc.metadata["last_updated"] = "2026-01-15"
# Filter retrieval
retriever = vectorstore.as_retriever(
search_kwargs={
"k": 4,
"filter": {"department": "HR"} # Only retrieve HR documents
}
)
4. Monitor and Log Performance
import time
from datetime import datetime
def query_with_metrics(rag_system, question):
start = time.time()
result = rag_system.query(question)
elapsed = time.time() - start
# Log metrics
print(f"[{datetime.now()}] Query: {question[:50]}...")
print(f" Response time: {elapsed:.2f}s")
print(f" Sources retrieved: {len(result['sources'])}")
print(f" Answer length: {len(result['answer'])} chars")
return result
Using Local Models for Complete Privacy
For sensitive documents, run everything locally with Ollama:
# Install Ollama: https://ollama.ai/
# Pull models:
# ollama pull llama3.3:70b
# ollama pull nomic-embed-text
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
# Use local models
embeddings = OllamaEmbeddings(model="nomic-embed-text")
llm = Ollama(model="llama3.3:70b", temperature=0)
# Everything else stays the same
# Vector store, retriever, chain logic is identical
# Now runs 100% locally with zero API costs
Common Issues and Solutions
| Problem | Cause | Solution |
|---|
| Answers lack context | Chunks too small or k too low | Increase chunk_size to 1500 or k to 6-8 |
| Irrelevant info in answers | Chunks too large, poor retrieval | Decrease chunk_size to 500, add metadata filters |
| Slow query responses | No embedding cache, large k | Implement CacheBackedEmbeddings, reduce k to 3-4 |
| High API costs | Using gpt-4, no caching | Switch to gpt-4o-mini, cache embeddings, consider local models |
| "I don't know" for known info | Poor chunking or embeddings | Adjust separators, try different embedding model, check document quality |
Next Steps
Now that you have a working RAG system, consider these enhancements:
- Advanced retrieval: Implement hybrid search (keyword + semantic), reranking, or parent-document retrieval
- Multiple data sources: Combine PDFs, web pages, databases, APIs into one knowledge base
- Conversation memory: Add ConversationBufferMemory for multi-turn Q&A sessions
- Web interface: Build a Streamlit or FastAPI frontend for your RAG system
- Production deployment: Deploy to AWS Lambda or Docker containers with horizontal scaling
For comprehensive professional training on RAG systems and LLM applications:
- RAG and Agents in Production (3-day intensive): Advanced RAG techniques, LangChain, LlamaIndex, vector database optimization, deployment patterns
- Claude API for Developers (2 days): Master Claude 4.5 for RAG applications, extended context windows (200K tokens), prompt optimization
Frequently Asked Questions
Why use RAG instead of just prompting an LLM directly?
LLMs are trained on data with a cutoff date and don't know about your private documents. RAG retrieves relevant context from your documents and includes it in the prompt, enabling the LLM to answer questions about information it was never trained on. This eliminates the need for expensive fine-tuning and provides source attribution for answers.
Can I use RAG with local/open-source models instead of OpenAI?
Absolutely. LangChain supports Ollama for local inference (use ChatOllama and OllamaEmbeddings). You can run Llama 3.3 70B or Mistral locally with zero API costs. For embeddings, nomic-embed-text via Ollama works excellently. This gives you complete data privacy and eliminates ongoing API costs.
What's the best chunk size for document splitting in RAG?
There's no universal answer - it depends on your use case. For technical documentation: 500-1000 characters with 100-200 overlap. For legal documents: 1500-2000 characters (preserve clause context). For chat transcripts: split by turn or timestamp. The key is testing retrieval quality - if answers lack context, increase chunk size; if irrelevant content appears, decrease it.
How do I handle multiple document formats (PDF, Word, HTML)?
LangChain provides specialized loaders: PyPDFLoader for PDFs, UnstructuredWordDocumentLoader for Word, WebBaseLoader for HTML. For production, use UnstructuredFileLoader which auto-detects format. All loaders return the same Document format, so your RAG pipeline stays consistent regardless of source format.
What's the typical response time and cost for a RAG query?
With OpenAI gpt-4o-mini and text-embedding-3-small: ~1-2 seconds per query, costing ~$0.002 per query (0.5¢ embeddings + 1.5¢ generation for 4 retrieved chunks). With local Llama via Ollama: 2-5 seconds (depending on GPU), $0 per query. For production, add caching (Redis) to reduce embedding costs by 60% for repeated queries.