Architecture Overview
A production RAG system has two distinct phases that run at different times:
Indexing Pipeline (runs once, or on document updates)
Raw Documents → Load → Clean → Chunk → Embed → Store in Vector DB
Query Pipeline (runs on every user request)
User Query → Embed → Retrieve Top-K Chunks → Build Prompt → LLM → Answer
The key insight is that embedding quality and chunk design are fixed at indexing time. A poorly chunked document cannot be rescued at query time — which is why this tutorial spends significant time on those early steps.
Step 1: Environment Setup
We'll use Python 3.11+. Install dependencies in a virtual environment:
# Create isolated environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Core RAG stack
pip install langchain langchain-openai langchain-community langchain-chroma
# Document processing
pip install pypdf unstructured[pdf] python-docx
# Vector databases
pip install chromadb # Local, open-source
pip install pinecone # Cloud-managed (optional)
# Evaluation framework
pip install ragas datasets
# Deployment dependencies
pip install fastapi uvicorn mangum # mangum = Lambda ASGI adapter
# Utilities
pip install python-dotenv tiktoken
Create a .env file for your credentials:
# .env
OPENAI_API_KEY=sk-... # Or use Ollama for free local inference
PINECONE_API_KEY=... # Only needed if using Pinecone
# For Ollama (free, local models):
# Install from https://ollama.ai, then:
# ollama pull nomic-embed-text (768-dim embeddings)
# ollama pull llama3.2 (8B model, fast)
Step 2: Vector Database Setup
Option A: ChromaDB (Local / Docker)
ChromaDB is the best starting point — free, runs in-process or as a server, and needs zero cloud configuration. For local development, use embedded mode:
# chroma_setup.py
import chromadb
from chromadb.config import Settings
# Embedded mode (single process, persisted to disk)
client = chromadb.PersistentClient(
path="./chroma_data",
settings=Settings(anonymized_telemetry=False)
)
# Create (or get existing) collection
collection = client.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"} # cosine similarity for semantic search
)
print(f"Collection ready: {collection.name}")
print(f"Documents indexed: {collection.count()}")
For a persistent server (shared across processes or Docker containers):
# Run ChromaDB as a standalone server:
# docker run -p 8000:8000 chromadb/chroma
# Connect from Python:
import chromadb
client = chromadb.HttpClient(host="localhost", port=8000)
collection = client.get_or_create_collection("documents")
Option B: Pinecone (Cloud, Production Scale)
Pinecone excels when you have millions of documents or need managed replication. Create a free account at pinecone.io, then:
# pinecone_setup.py
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="your-api-key")
# Create index (1536 dims = OpenAI text-embedding-3-small)
# For nomic-embed-text: dimension=768
if "rag-docs" not in pc.list_indexes().names():
pc.create_index(
name="rag-docs",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index("rag-docs")
print(index.describe_index_stats())
# Output: {'dimension': 1536, 'total_vector_count': 0, ...}
Cost note: Pinecone Serverless charges $0.096 per 1M reads and $2/GB/month storage. A 10,000-document knowledge base costs roughly $2-5/month. For <1M documents, ChromaDB on a $5/month VPS is cheaper.
Step 3: Document Loading and Chunking Strategy
Chunking is the most consequential design decision in a RAG system. Chunks that are too small lose context; chunks too large dilute retrieval precision.
Loading Multiple Document Formats
# document_loader.py
from langchain_community.document_loaders import (
PyPDFLoader,
UnstructuredWordDocumentLoader,
WebBaseLoader,
DirectoryLoader,
)
from pathlib import Path
def load_documents(source_dir: str = "./docs") -> list:
"""Load all documents from a directory, auto-detecting format."""
loaders = {
"**/*.pdf": PyPDFLoader,
"**/*.docx": UnstructuredWordDocumentLoader,
}
all_docs = []
for pattern, loader_cls in loaders.items():
loader = DirectoryLoader(
source_dir,
glob=pattern,
loader_cls=loader_cls,
show_progress=True,
)
docs = loader.load()
all_docs.extend(docs)
print(f"Loaded {len(docs)} pages from {pattern} files")
# Add metadata for filtering later
for doc in all_docs:
doc.metadata["ingested_at"] = "2026-04-09"
print(f"\nTotal: {len(all_docs)} document pages loaded")
return all_docs
docs = load_documents("./docs")
# Output:
# Loaded 45 pages from **/*.pdf files
# Loaded 12 pages from **/*.docx files
# Total: 57 document pages loaded
Strategy 1: Recursive Character Splitting (Baseline)
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Good default for most document types
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Target chunk size in characters
chunk_overlap=200, # Overlap preserves context across boundaries
length_function=len,
separators=[
"\n\n", # Prefer splitting on paragraph breaks
"\n", # Then line breaks
". ", # Then sentence boundaries
" ", # Then words
"", # Character fallback
],
)
chunks = splitter.split_documents(docs)
print(f"Split into {len(chunks)} chunks")
print(f"Avg chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")
# Output:
# Split into 341 chunks
# Avg chunk size: 847 chars
Strategy 2: Semantic Chunking (Better Recall)
Semantic chunking splits on topic boundaries detected by embedding similarity, rather than fixed character counts. It adds 15-25% to context recall in benchmarks because related content stays together.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
semantic_splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile", # Split when similarity drops below 95th percentile
breakpoint_threshold_amount=95,
)
semantic_chunks = semantic_splitter.split_documents(docs)
print(f"Semantic chunks: {len(semantic_chunks)}")
print(f"Avg chunk size: {sum(len(c.page_content) for c in semantic_chunks) // len(semantic_chunks)} chars")
# Output (chunks are variable size, topic-aligned):
# Semantic chunks: 198
# Avg chunk size: 1423 chars
# Trade-off: ~2x more tokens to embed, but much better retrieval quality
# Cost: ~$0.003 per 1M chars with text-embedding-3-small
Chunk Size Comparison
| Document Type | Recommended Chunk Size | Overlap | Splitter |
|---|
| Technical docs / APIs | 500-800 chars | 100-150 | Recursive |
| Legal / contracts | 1500-2000 chars | 300-400 | Recursive (sentence) |
| Research papers | Topic-based | N/A | Semantic |
| Customer support FAQs | One Q&A per chunk | 0 | Custom (split on Q:) |
| Code files | Function / class | 0-50 | RecursiveCharacter (code) |
Step 4: Embedding and Vector Indexing
# indexer.py
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
import os
# Initialize embedding model
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small", # 1536-dim, $0.02 per 1M tokens
# Alternative: text-embedding-3-large (3072-dim, higher accuracy, 5x cost)
)
# Build (or load) vector store
PERSIST_DIR = "./chroma_data"
if os.path.exists(PERSIST_DIR) and os.listdir(PERSIST_DIR):
print("Loading existing vector store...")
vectorstore = Chroma(
persist_directory=PERSIST_DIR,
embedding_function=embeddings,
collection_name="documents",
)
else:
print(f"Indexing {len(chunks)} chunks...")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=PERSIST_DIR,
collection_name="documents",
collection_metadata={"hnsw:space": "cosine"},
)
print("Indexing complete.")
print(f"Vector store ready: {vectorstore._collection.count()} vectors")
# Output: Vector store ready: 341 vectors
Using free local embeddings with Ollama:
# First pull the model:
# ollama pull nomic-embed-text
from langchain_community.embeddings import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text") # 768-dim, free, local
# Rest of the indexing code is identical
Step 5: Retrieval Chain and Generation
# rag_chain.py
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Retriever: MMR (Maximal Marginal Relevance) reduces duplicate chunks
retriever = vectorstore.as_retriever(
search_type="mmr", # Balances relevance + diversity
search_kwargs={
"k": 5, # Return 5 chunks
"fetch_k": 20, # Consider 20 candidates, pick diverse 5
"lambda_mult": 0.7, # 0=max diversity, 1=max relevance
},
)
SYSTEM_PROMPT = """You are a helpful assistant. Answer the user's question using ONLY the context below.
If the context does not contain enough information, say "I don't have enough information to answer this."
Do not make up information or draw from outside knowledge.
Context:
{context}"""
prompt = ChatPromptTemplate.from_messages([
("system", SYSTEM_PROMPT),
("human", "{question}"),
])
def format_docs(docs: list) -> str:
"""Format retrieved documents with source attribution."""
parts = []
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("source", "unknown")
page = doc.metadata.get("page", "")
label = f"[{i}] {source}" + (f" p.{page}" if page else "")
parts.append(f"{label}\n{doc.page_content}")
return "\n\n---\n\n".join(parts)
# Chain with source tracking
rag_chain_with_sources = RunnableParallel(
answer=(
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
),
sources=(retriever),
)
# Query
result = rag_chain_with_sources.invoke("What is the refund policy?")
print(f"Answer:\n{result['answer']}\n")
print("Sources:")
for doc in result["sources"]:
print(f" - {doc.metadata.get('source')} (p.{doc.metadata.get('page', '?')})")
# Expected output:
# Answer:
# According to the refund policy section, customers may request a full refund
# within 30 days of purchase if the product is unused and in original condition.
#
# Sources:
# - terms_and_conditions.pdf (p.4)
# - faq.pdf (p.12)
Step 6: Retrieval Quality Testing
Before evaluating with RAGAS, manually test your retriever to catch obvious configuration problems. This takes 10 minutes and catches 80% of issues.
# retrieval_test.py
from typing import NamedTuple
class RetrievalTest(NamedTuple):
query: str
expected_keywords: list[str] # Words that must appear in retrieved chunks
expected_k: int = 3 # Minimum chunks expected with relevant content
RETRIEVAL_TESTS = [
RetrievalTest(
query="What is the refund policy?",
expected_keywords=["refund", "return", "days"],
expected_k=2,
),
RetrievalTest(
query="How do I reset my password?",
expected_keywords=["password", "reset", "email"],
expected_k=1,
),
RetrievalTest(
query="What payment methods are accepted?",
expected_keywords=["payment", "credit card", "paypal"],
expected_k=2,
),
]
def run_retrieval_tests(retriever, tests: list[RetrievalTest]) -> dict:
"""Run retrieval tests and report pass/fail."""
results = {"passed": 0, "failed": 0, "details": []}
for test in tests:
docs = retriever.invoke(test.query)
combined_text = " ".join(d.page_content.lower() for d in docs)
# Check all expected keywords appear in retrieved content
keywords_found = {kw: kw.lower() in combined_text for kw in test.expected_keywords}
all_found = all(keywords_found.values())
has_enough_docs = len(docs) >= test.expected_k
passed = all_found and has_enough_docs
results["passed" if passed else "failed"] += 1
results["details"].append({
"query": test.query,
"passed": passed,
"chunks_retrieved": len(docs),
"keywords_found": keywords_found,
})
return results
report = run_retrieval_tests(retriever, RETRIEVAL_TESTS)
for detail in report["details"]:
status = "PASS" if detail["passed"] else "FAIL"
print(f"[{status}] {detail['query']}")
if not detail["passed"]:
missing = [k for k, v in detail["keywords_found"].items() if not v]
print(f" Missing keywords: {missing}")
print(f" Chunks retrieved: {detail['chunks_retrieved']}")
print(f"\nResults: {report['passed']}/{len(RETRIEVAL_TESTS)} tests passed")
# Output:
# [PASS] What is the refund policy?
# [PASS] How do I reset my password?
# [FAIL] What payment methods are accepted?
# Missing keywords: ['paypal']
# Chunks retrieved: 5
# Results: 2/3 tests passed
If a test fails, debug in this order: (1) verify the keyword exists in your documents, (2) increase k, (3) try a different query phrasing, (4) check chunk boundaries aren't splitting the keyword away from context.
Step 7: Evaluation with RAGAS
RAGAS (Retrieval Augmented Generation Assessment) measures four dimensions that matter in production. Unlike manual testing, RAGAS uses an LLM-as-judge approach to score at scale.
| Metric | What it measures | Production target |
|---|
| Faithfulness | Answer is grounded in retrieved context (no hallucination) | > 0.85 |
| Answer Relevancy | Answer actually addresses the question asked | > 0.80 |
| Context Precision | Retrieved chunks are relevant (no noisy chunks) | > 0.75 |
| Context Recall | All relevant information was retrieved (none missed) | > 0.70 |
Building a Test Dataset
# evaluation_dataset.py
from datasets import Dataset
# Build Q&A pairs from your documents
# Ground truth answers come from the source documents
evaluation_data = {
"question": [
"What is the refund policy for digital products?",
"How long does shipping take to Europe?",
"Can I use the product commercially?",
"What languages is customer support available in?",
"Is there a free trial period?",
],
"ground_truth": [
"Digital products are non-refundable except in cases of technical issues verified by our support team.",
"Standard shipping to Europe takes 7-14 business days. Express shipping takes 3-5 business days.",
"Yes, commercial use is permitted under the Professional and Enterprise license tiers.",
"Customer support is available in English, French, Spanish, and German.",
"Yes, all plans include a 14-day free trial with full feature access and no credit card required.",
],
# These will be filled automatically during evaluation
"contexts": [],
"answer": [],
}
eval_dataset = Dataset.from_dict(evaluation_data)
print(f"Evaluation dataset: {len(eval_dataset)} questions")
Running RAGAS Evaluation
# run_evaluation.py
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Generate answers and collect contexts for each question
def prepare_eval_dataset(dataset, retriever, rag_chain):
"""Run RAG over test questions to populate contexts and answers."""
contexts_list = []
answers_list = []
for question in dataset["question"]:
# Retrieve context
docs = retriever.invoke(question)
contexts = [doc.page_content for doc in docs]
contexts_list.append(contexts)
# Generate answer
answer = rag_chain.invoke(question)
answers_list.append(answer)
dataset = dataset.add_column("contexts", contexts_list)
dataset = dataset.add_column("answer", answers_list)
return dataset
# Prepare dataset with generated answers
eval_ready = prepare_eval_dataset(eval_dataset, retriever, rag_chain_with_sources["answer"])
# Run RAGAS evaluation
ragas_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
ragas_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
results = evaluate(
dataset=eval_ready,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=ragas_llm,
embeddings=ragas_embeddings,
)
print("\n=== RAGAS Evaluation Results ===")
print(f"Faithfulness: {results['faithfulness']:.3f} (target: >0.85)")
print(f"Answer Relevancy: {results['answer_relevancy']:.3f} (target: >0.80)")
print(f"Context Precision: {results['context_precision']:.3f} (target: >0.75)")
print(f"Context Recall: {results['context_recall']:.3f} (target: >0.70)")
# Typical output for a well-tuned system:
# === RAGAS Evaluation Results ===
# Faithfulness: 0.912 (target: >0.85)
# Answer Relevancy: 0.847 (target: >0.80)
# Context Precision: 0.783 (target: >0.75)
# Context Recall: 0.741 (target: >0.70)
Diagnosing and Improving Low Scores
# diagnosis.py
def diagnose_ragas_failures(results, threshold=0.75):
"""Print actionable remediation for each failing metric."""
metrics = {
"faithfulness": {
"score": results["faithfulness"],
"fixes": [
"Tighten the system prompt: 'Answer ONLY using the context. Never add information.'",
"Reduce temperature to 0 for deterministic, grounded answers",
"Add a post-processing step to verify every claim appears in context",
],
},
"answer_relevancy": {
"score": results["answer_relevancy"],
"fixes": [
"Improve query rewriting — add a step to rephrase ambiguous questions",
"Adjust prompt to require answering the specific question asked",
"Check if low-relevancy answers are caused by off-topic retrieved chunks",
],
},
"context_precision": {
"score": results["context_precision"],
"fixes": [
"Reduce k (retrieved chunks) — fewer but better chunks improve precision",
"Add metadata filters to narrow search scope",
"Try MMR search_type to reduce duplicate/noisy chunks",
"Switch to hybrid search (BM25 + semantic) for keyword-heavy queries",
],
},
"context_recall": {
"score": results["context_recall"],
"fixes": [
"Increase k to retrieve more candidate chunks",
"Improve chunking — large chunks may split relevant content",
"Use semantic chunking to preserve topic boundaries",
"Add query expansion (generate multiple phrasings of the query)",
],
},
}
print("\n=== Diagnosis Report ===")
for metric, data in metrics.items():
if data["score"] < threshold:
print(f"\nFAIL: {metric} = {data['score']:.3f}")
print("Recommended fixes:")
for fix in data["fixes"]:
print(f" • {fix}")
diagnose_ragas_failures(results)
Step 8: Deployment
Option A: Docker Compose (Local / VPS)
Package the RAG API as a FastAPI service alongside ChromaDB. This runs identically in development, on a VPS, or in a container orchestrator.
# app/main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import os
app = FastAPI(title="RAG API", version="1.0.0")
class QueryRequest(BaseModel):
question: str
k: int = 5
class QueryResponse(BaseModel):
answer: str
sources: list[dict]
latency_ms: float
# Initialize components on startup
@app.on_event("startup")
async def startup():
global retriever, chain
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
host=os.getenv("CHROMA_HOST", "chroma"), # Docker service name
port=int(os.getenv("CHROMA_PORT", "8000")),
collection_name="documents",
embedding_function=embeddings,
)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 5})
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer using ONLY the context below.\n\nContext:\n{context}"),
("human", "{question}"),
])
chain = (
{"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
"question": RunnablePassthrough()}
| prompt | llm | StrOutputParser()
)
@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
import time
start = time.time()
try:
docs = retriever.invoke(request.question)
answer = chain.invoke(request.question)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
return QueryResponse(
answer=answer,
sources=[
{"source": d.metadata.get("source", ""), "page": d.metadata.get("page")}
for d in docs
],
latency_ms=(time.time() - start) * 1000,
)
@app.get("/health")
async def health():
return {"status": "ok"}
# docker-compose.yml
version: "3.9"
services:
chroma:
image: chromadb/chroma:latest
ports:
- "8000:8000"
volumes:
- chroma_data:/chroma/chroma
environment:
- CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER=chromadb.auth.token.TokenConfigServerAuthCredentialsProvider
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/heartbeat"]
interval: 10s
timeout: 5s
retries: 3
rag_api:
build: .
ports:
- "8080:8080"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- CHROMA_HOST=chroma
- CHROMA_PORT=8000
depends_on:
chroma:
condition: service_healthy
command: uvicorn app.main:app --host 0.0.0.0 --port 8080
volumes:
chroma_data:
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app/ ./app/
# Start API
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]
# Build and run
docker-compose up --build
# Test
curl -X POST http://localhost:8080/query \
-H "Content-Type: application/json" \
-d '{"question": "What is the refund policy?"}'
# Response:
# {
# "answer": "Refunds are available within 30 days of purchase...",
# "sources": [{"source": "terms.pdf", "page": 4}],
# "latency_ms": 1247.3
# }
Option B: AWS Lambda (Serverless)
For serverless deployment, use Mangum to adapt FastAPI to Lambda's event format, and Pinecone (or ChromaDB on EFS) as the vector store. Lambda removes server management at the cost of cold starts.
# lambda_handler.py
from mangum import Mangum
from app.main import app # FastAPI app from above
# Mangum wraps FastAPI for Lambda + API Gateway
handler = Mangum(app, lifespan="off")
# Deployment steps:
# 1. Package dependencies into a Lambda layer or container image
# 2. Set Lambda environment variables: OPENAI_API_KEY, PINECONE_API_KEY
# 3. Connect to Pinecone instead of ChromaDB (ChromaDB on Lambda is complex)
# 4. Set memory to 1024MB minimum (vector operations need RAM)
# 5. Set timeout to 30 seconds (RAGAS queries can be slow)
# serverless.yml (Serverless Framework)
service: rag-api
provider:
name: aws
runtime: python3.11
region: eu-west-1
memorySize: 1024 # MB — vector operations need RAM
timeout: 30 # Seconds — allow for cold start + LLM generation
environment:
OPENAI_API_KEY: ${env:OPENAI_API_KEY}
PINECONE_API_KEY: ${env:PINECONE_API_KEY}
VECTOR_STORE: pinecone # Use Pinecone for serverless (no persistent FS)
functions:
api:
handler: lambda_handler.handler
events:
- httpApi:
path: /{proxy+}
method: ANY
layers:
- ${cf:rag-dependencies-layer.LambdaLayerArn}
# Deploy:
# npm install -g serverless
# serverless deploy --stage prod
Lambda cost estimate: 1,000 daily RAG queries × 30s × 1024MB = ~$2/month for compute. Pinecone adds ~$2/month for a small index. Total: ~$4/month for a production-ready serverless RAG API serving 30K queries/month.
Performance Optimization Checklist
- Cache embeddings: Use
CacheBackedEmbeddings with Redis to avoid re-embedding identical queries — saves 60-80% on embedding API costs for production traffic - Async retrieval: Use
retriever.ainvoke() and llm.ainvoke() for non-blocking I/O in FastAPI — supports 3-5x more concurrent requests on the same hardware - Batch indexing: When indexing >10K documents, use
vectorstore.add_documents() in batches of 100 to avoid rate limits - Reduce k first: Lower k (retrieved chunks) before any other optimization — going from k=10 to k=4 halves prompt tokens and typically improves precision
- Use smaller generation models: gpt-4o-mini costs 30x less than gpt-4o with 85-90% of the answer quality for factual retrieval tasks
Next Steps
- Hybrid search: Combine BM25 keyword search with semantic search using
EnsembleRetriever — improves precision on exact-match queries by 20-30% - Reranking: Add a Cohere or cross-encoder reranker after retrieval to re-score chunks — consistently improves answer quality at ~$0.001 per query extra cost
- Multi-modal RAG: Extend to images and tables using GPT-4o vision or Unstructured's table extraction
- Agentic RAG: Use LangGraph to build a retrieval agent that decides when to search, what to search for, and when it has enough context
For structured professional training on these topics:
Frequently Asked Questions
What is the difference between ChromaDB and Pinecone for RAG?
ChromaDB is a free, open-source vector database that runs locally (or in Docker). It's ideal for development, small-to-medium datasets (<10M vectors), and privacy-sensitive deployments. Pinecone is a managed cloud service with automatic scaling, serverless billing (~$0.096 per 1M reads), and built-in replication — best for production systems with millions of documents or teams without infrastructure expertise. You can build with ChromaDB and migrate to Pinecone later without changing your LangChain retriever code.
What RAGAS scores should I target before going to production?
Industry benchmarks for production RAG systems: Faithfulness > 0.85 (LLM's answer is grounded in retrieved context), Answer Relevancy > 0.80 (response addresses the question), Context Precision > 0.75 (retrieved chunks are relevant), Context Recall > 0.70 (enough relevant context is retrieved). If any score is below threshold, diagnose the specific failure: low context recall → increase k or improve embeddings; low faithfulness → improve the system prompt to reduce hallucination; low answer relevancy → refine query rewriting.
How do I choose chunk size? Is there a formula?
No universal formula, but a practical starting heuristic: chunk size ≈ 75-85% of the tokens in the LLM's context window you're allocating per chunk, converted to characters. For most retrieval tasks start with 1000 characters / 200 overlap and benchmark. If your queries are short (< 5 words), smaller chunks (500 chars) retrieve more precisely. If queries are complex multi-sentence questions, larger chunks (1500-2000) preserve reasoning context. Semantic chunking (splitting on topic boundaries rather than character counts) consistently outperforms fixed-size splitting by 15-25% on context recall — worth the extra implementation time.
What are the cold start costs for AWS Lambda with a vector database?
Lambda cold starts for a Python RAG function add 800ms-2s depending on package size. Mitigation: use Lambda layers for heavy dependencies (LangChain, numpy), keep function package under 50MB, and set provisioned concurrency (1-2 instances, ~$15/month) for latency-critical paths. The vector database call (ChromaDB EFS or Pinecone) adds 50-300ms per query. Total P95 latency target: < 3 seconds for the full RAG cycle (embed query → retrieve → generate).
Can I run the full RAG pipeline locally without any API costs?
Yes. Use Ollama for local LLM inference (llama3.2 or mistral) and local embeddings (nomic-embed-text), plus ChromaDB as the vector store. All free. Run `ollama pull llama3.2` and `ollama pull nomic-embed-text`, then replace the OpenAI clients with OllamaEmbeddings and ChatOllama in LangChain. On an M2 MacBook Pro, expect 2-4 tokens/sec for llama3.2 70B, or 15-20 tokens/sec for llama3.2 8B. For the Docker deployment in this tutorial, add an Ollama service to the compose file and point your RAG service to it.