Talki Academy
Technical12 min read

Building RAG Systems with LangChain: Practical Tutorial with Working Code

Retrieval-Augmented Generation (RAG) enables LLMs to answer questions about your private documents without expensive fine-tuning. This hands-on tutorial walks through building a production-ready RAG system using LangChain: from loading documents and creating embeddings, to querying with source attribution. Includes complete working code, best practices for chunking and retrieval, and guidance on using both cloud APIs and local open-source models.

By Talki Academy·Published April 5, 2026

What is RAG and Why Does It Matter?

Retrieval-Augmented Generation (RAG) solves a fundamental limitation of Large Language Models: they don't know about information that didn't exist in their training data. If you ask ChatGPT about your company's internal documentation, a recent research paper, or last quarter's sales data, it can't answer because it was never trained on that information.

RAG bridges this gap by retrieving relevant documents from your knowledge base and including them in the prompt sent to the LLM. The LLM then generates an answer based on the provided context. This approach offers several key advantages:

  • No fine-tuning required: Update your knowledge base anytime without retraining models
  • Source attribution: Know exactly which documents were used to generate each answer
  • Cost-effective: Cheaper than fine-tuning and maintaining custom models
  • Privacy-friendly: Can run entirely on-premises with local models
  • Always current: Answers reflect your latest documents, not stale training data

RAG Architecture Overview

A RAG system consists of two main phases: indexing (preprocessing) and retrieval (query time).

Phase 1: Indexing (One-Time Setup)

  1. Load documents: Read PDFs, Word files, web pages, etc.
  2. Split into chunks: Break documents into smaller pieces (500-1000 characters)
  3. Generate embeddings: Convert text chunks into numerical vectors
  4. Store in vector database: Index embeddings for fast similarity search

Phase 2: Retrieval (Every Query)

  1. Embed the query: Convert user question to vector
  2. Search vector database: Find most similar document chunks
  3. Format prompt: Combine query + retrieved chunks
  4. Generate answer: LLM responds based on provided context
  5. Return with sources: Include references to source documents

Step-by-Step Implementation

Step 1: Environment Setup

First, install the required dependencies. We'll use ChromaDB as our vector store (free, open-source, runs locally).

# Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install core dependencies pip install langchain langchain-openai langchain-community # Install document loaders and text splitting pip install pypdf unstructured # Install vector store pip install chromadb # Install OpenAI client pip install openai # For local models (optional - requires Ollama installed) pip install ollama

Set your OpenAI API key (or skip if using local models):

export OPENAI_API_KEY="your-api-key-here" # Or create a .env file: # OPENAI_API_KEY=your-api-key-here # In Python, load with: from dotenv import load_dotenv load_dotenv()

Step 2: Load and Process Documents

Let's start by loading a PDF document. LangChain provides specialized loaders for different formats.

from langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter # Load PDF loader = PyPDFLoader("company_handbook.pdf") documents = loader.load() print(f"Loaded {len(documents)} pages from PDF") print(f"First page preview: {documents[0].page_content[:200]}...") # Expected output: # Loaded 45 pages from PDF # First page preview: Employee Handbook # # Welcome to Acme Corporation # # This handbook contains important policies and procedures...

Documents are now loaded, but full pages are too large to use as context. We need to split them into smaller chunks.

# Initialize text splitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Max characters per chunk chunk_overlap=200, # Overlap between chunks to preserve context length_function=len, separators=["\n\n", "\n", " ", ""] # Try to split on paragraphs first ) # Split documents into chunks chunks = text_splitter.split_documents(documents) print(f"Split into {len(chunks)} chunks") print(f"\nExample chunk:") print(f"Content: {chunks[10].page_content}") print(f"Metadata: {chunks[10].metadata}") # Expected output: # Split into 287 chunks # # Example chunk: # Content: Vacation Policy # # Full-time employees accrue 15 days of paid vacation per year... # Metadata: {'source': 'company_handbook.pdf', 'page': 12}

Step 3: Create Embeddings and Vector Store

Now we convert text chunks into numerical vectors (embeddings) and store them in ChromaDB for fast retrieval.

from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import Chroma # Initialize embeddings model embeddings = OpenAIEmbeddings( model="text-embedding-3-small" # Cost: $0.02 per 1M tokens ) # Create vector store and index documents vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db" # Save to disk for reuse ) print(f"✅ Indexed {len(chunks)} chunks in vector store") # For subsequent runs, load existing database: # vectorstore = Chroma( # persist_directory="./chroma_db", # embedding_function=embeddings # )

Using local models instead (no API costs, complete privacy):

# Requires Ollama installed: https://ollama.ai/ # Then run: ollama pull nomic-embed-text from langchain_community.embeddings import OllamaEmbeddings embeddings = OllamaEmbeddings( model="nomic-embed-text" # 768-dim embeddings, free, runs locally ) # Rest of the code is identical - LangChain abstracts the implementation

Step 4: Create Retriever and Test Queries

The retriever searches the vector store for chunks most similar to the user's query.

# Create retriever retriever = vectorstore.as_retriever( search_type="similarity", # Other options: "mmr" (diversity), "similarity_score_threshold" search_kwargs={"k": 4} # Retrieve top 4 most relevant chunks ) # Test retrieval query = "What is the vacation policy?" relevant_docs = retriever.invoke(query) print(f"Query: {query}") print(f"\nFound {len(relevant_docs)} relevant chunks:") for i, doc in enumerate(relevant_docs): print(f"\n[{i+1}] Page {doc.metadata.get('page', 'N/A')}") print(f"Content: {doc.page_content[:200]}...") # Expected output: # Query: What is the vacation policy? # # Found 4 relevant chunks: # # [1] Page 12 # Content: Vacation Policy # # Full-time employees accrue 15 days of paid vacation per year...

Step 5: Build the RAG Chain

Now combine the retriever with an LLM to create a complete RAG system. We'll use LangChain's modern LCEL syntax for clarity.

from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough # Initialize LLM llm = ChatOpenAI( model="gpt-4o-mini", # Cost: $0.15 per 1M input tokens temperature=0 # Deterministic answers for factual queries ) # Create prompt template template = """You are a helpful assistant answering questions based on provided context. Use the following pieces of context to answer the question at the end. If you don't know the answer based on the context, say "I don't have enough information to answer that" - don't make up information. Context: {context} Question: {question} Answer: Provide a clear, concise answer based only on the context above. If you reference specific policies or procedures, mention where they come from.""" prompt = ChatPromptTemplate.from_template(template) # Helper function to format retrieved documents def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) # Create RAG chain using LCEL rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) # Query the RAG system question = "How many vacation days do employees get?" answer = rag_chain.invoke(question) print(f"Question: {question}") print(f"\nAnswer: {answer}") # Expected output: # Question: How many vacation days do employees get? # # Answer: According to the Vacation Policy on page 12, full-time employees # accrue 15 days of paid vacation per year. Part-time employees accrue vacation # on a pro-rated basis depending on their scheduled hours.

Step 6: Add Source Attribution

For production systems, you want to return source documents alongside the answer so users can verify information.

from langchain.chains import RetrievalQA # Alternative: use RetrievalQA for automatic source tracking qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", # "stuff" = include all docs in one prompt retriever=retriever, return_source_documents=True, chain_type_kwargs={ "prompt": prompt, } ) # Query with sources result = qa_chain.invoke({"query": "What is the remote work policy?"}) print(f"Question: {result['query']}") print(f"\nAnswer: {result['result']}") print(f"\nSources:") for i, doc in enumerate(result['source_documents']): print(f" [{i+1}] Page {doc.metadata.get('page', 'N/A')} - {doc.metadata.get('source', 'Unknown')}") print(f" Preview: {doc.page_content[:150]}...") # Expected output: # Question: What is the remote work policy? # # Answer: Employees may work remotely up to 2 days per week with manager # approval. Remote work arrangements must be documented in writing and # reviewed quarterly. # # Sources: # [1] Page 18 - company_handbook.pdf # Preview: Remote Work Policy # # To support work-life balance, Acme permits remote work under # the following conditions...

Complete Working Example

Here's a complete, runnable RAG implementation you can use as a starting point:

# rag_system.py from langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_community.vectorstores import Chroma from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough import os class RAGSystem: def __init__(self, pdf_path: str, persist_dir: str = "./chroma_db"): """Initialize RAG system with a PDF document.""" self.pdf_path = pdf_path self.persist_dir = persist_dir # Initialize components self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small") self.llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) # Load or create vector store if os.path.exists(persist_dir): print("Loading existing vector store...") self.vectorstore = Chroma( persist_directory=persist_dir, embedding_function=self.embeddings ) else: print("Creating new vector store...") self._index_documents() # Create retriever self.retriever = self.vectorstore.as_retriever( search_kwargs={"k": 4} ) # Create RAG chain self._create_chain() def _index_documents(self): """Load, split, and index documents.""" # Load PDF loader = PyPDFLoader(self.pdf_path) documents = loader.load() print(f"Loaded {len(documents)} pages") # Split into chunks text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", " ", ""] ) chunks = text_splitter.split_documents(documents) print(f"Split into {len(chunks)} chunks") # Create and persist vector store self.vectorstore = Chroma.from_documents( documents=chunks, embedding=self.embeddings, persist_directory=self.persist_dir ) print(f"✅ Indexed {len(chunks)} chunks") def _create_chain(self): """Create the RAG chain.""" template = """Answer the question based on the following context. If you don't know, say "I don't have enough information" - don't make up information. Context: {context} Question: {question} Answer:""" prompt = ChatPromptTemplate.from_template(template) def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) self.chain = ( {"context": self.retriever | format_docs, "question": RunnablePassthrough()} | prompt | self.llm | StrOutputParser() ) def query(self, question: str) -> dict: """Query the RAG system and return answer with sources.""" # Get answer answer = self.chain.invoke(question) # Get source documents sources = self.retriever.invoke(question) return { "question": question, "answer": answer, "sources": [ { "page": doc.metadata.get("page", "N/A"), "content": doc.page_content[:200] + "..." } for doc in sources ] } # Usage example if __name__ == "__main__": # Initialize RAG system rag = RAGSystem("company_handbook.pdf") # Ask questions questions = [ "What is the vacation policy?", "How many sick days do employees get?", "What is the remote work policy?" ] for q in questions: result = rag.query(q) print(f"\nQ: {result['question']}") print(f"A: {result['answer']}") print(f"Sources: {len(result['sources'])} documents") print("-" * 80)

Best Practices for Production

1. Optimize Chunk Size and Overlap

The right chunk size depends on your use case. Test with different values:

  • Technical docs: 500-1000 characters, 100-200 overlap
  • Legal documents: 1500-2000 characters (preserve clause context)
  • Chat logs: Split by message or timestamp
  • Code files: Split by function or class definition

2. Implement Caching for Embeddings

Avoid re-computing embeddings for the same queries:

from langchain.embeddings import CacheBackedEmbeddings from langchain.storage import LocalFileStore # Create cache store = LocalFileStore("./embedding_cache/") # Wrap embeddings with cache cached_embeddings = CacheBackedEmbeddings.from_bytes_store( underlying_embeddings=OpenAIEmbeddings(), document_embedding_cache=store, namespace="openai_embeddings" ) # Use cached_embeddings in vectorstore # Reduces costs by 60% for repeated queries

3. Add Metadata Filtering

Filter retrieval by document type, date, author, etc:

# When loading documents, add metadata for doc in documents: doc.metadata["department"] = "HR" doc.metadata["last_updated"] = "2026-01-15" # Filter retrieval retriever = vectorstore.as_retriever( search_kwargs={ "k": 4, "filter": {"department": "HR"} # Only retrieve HR documents } )

4. Monitor and Log Performance

import time from datetime import datetime def query_with_metrics(rag_system, question): start = time.time() result = rag_system.query(question) elapsed = time.time() - start # Log metrics print(f"[{datetime.now()}] Query: {question[:50]}...") print(f" Response time: {elapsed:.2f}s") print(f" Sources retrieved: {len(result['sources'])}") print(f" Answer length: {len(result['answer'])} chars") return result

Using Local Models for Complete Privacy

For sensitive documents, run everything locally with Ollama:

# Install Ollama: https://ollama.ai/ # Pull models: # ollama pull llama3.3:70b # ollama pull nomic-embed-text from langchain_community.llms import Ollama from langchain_community.embeddings import OllamaEmbeddings # Use local models embeddings = OllamaEmbeddings(model="nomic-embed-text") llm = Ollama(model="llama3.3:70b", temperature=0) # Everything else stays the same # Vector store, retriever, chain logic is identical # Now runs 100% locally with zero API costs

Common Issues and Solutions

ProblemCauseSolution
Answers lack contextChunks too small or k too lowIncrease chunk_size to 1500 or k to 6-8
Irrelevant info in answersChunks too large, poor retrievalDecrease chunk_size to 500, add metadata filters
Slow query responsesNo embedding cache, large kImplement CacheBackedEmbeddings, reduce k to 3-4
High API costsUsing gpt-4, no cachingSwitch to gpt-4o-mini, cache embeddings, consider local models
"I don't know" for known infoPoor chunking or embeddingsAdjust separators, try different embedding model, check document quality

Next Steps

Now that you have a working RAG system, consider these enhancements:

  • Advanced retrieval: Implement hybrid search (keyword + semantic), reranking, or parent-document retrieval
  • Multiple data sources: Combine PDFs, web pages, databases, APIs into one knowledge base
  • Conversation memory: Add ConversationBufferMemory for multi-turn Q&A sessions
  • Web interface: Build a Streamlit or FastAPI frontend for your RAG system
  • Production deployment: Deploy to AWS Lambda or Docker containers with horizontal scaling

For comprehensive professional training on RAG systems and LLM applications:

  • RAG and Agents in Production (3-day intensive): Advanced RAG techniques, LangChain, LlamaIndex, vector database optimization, deployment patterns
  • Claude API for Developers (2 days): Master Claude 4.5 for RAG applications, extended context windows (200K tokens), prompt optimization

Frequently Asked Questions

Why use RAG instead of just prompting an LLM directly?

LLMs are trained on data with a cutoff date and don't know about your private documents. RAG retrieves relevant context from your documents and includes it in the prompt, enabling the LLM to answer questions about information it was never trained on. This eliminates the need for expensive fine-tuning and provides source attribution for answers.

Can I use RAG with local/open-source models instead of OpenAI?

Absolutely. LangChain supports Ollama for local inference (use ChatOllama and OllamaEmbeddings). You can run Llama 3.3 70B or Mistral locally with zero API costs. For embeddings, nomic-embed-text via Ollama works excellently. This gives you complete data privacy and eliminates ongoing API costs.

What's the best chunk size for document splitting in RAG?

There's no universal answer - it depends on your use case. For technical documentation: 500-1000 characters with 100-200 overlap. For legal documents: 1500-2000 characters (preserve clause context). For chat transcripts: split by turn or timestamp. The key is testing retrieval quality - if answers lack context, increase chunk size; if irrelevant content appears, decrease it.

How do I handle multiple document formats (PDF, Word, HTML)?

LangChain provides specialized loaders: PyPDFLoader for PDFs, UnstructuredWordDocumentLoader for Word, WebBaseLoader for HTML. For production, use UnstructuredFileLoader which auto-detects format. All loaders return the same Document format, so your RAG pipeline stays consistent regardless of source format.

What's the typical response time and cost for a RAG query?

With OpenAI gpt-4o-mini and text-embedding-3-small: ~1-2 seconds per query, costing ~$0.002 per query (0.5¢ embeddings + 1.5¢ generation for 4 retrieved chunks). With local Llama via Ollama: 2-5 seconds (depending on GPU), $0 per query. For production, add caching (Redis) to reduce embedding costs by 60% for repeated queries.

Master RAG and LLM Applications

Professional training programs for developers building AI-powered applications.

View Training ProgramsContact Us