Talki Academy
Tutorial35 min read

RAG Pipeline End-to-End: Build, Evaluate, and Deploy (2026)

Build a complete Retrieval-Augmented Generation system from scratch — covering vector database setup with ChromaDB and Pinecone, advanced chunking strategies, retrieval quality testing, evaluation with RAGAS metrics, and deployment to Docker and AWS Lambda. Every step includes complete, runnable Python code you can adapt today.

By Talki Academy·Published April 9, 2026

What you'll build

  1. Environment setup and dependencies
  2. Vector database: ChromaDB (local) and Pinecone (cloud)
  3. Document loading and advanced chunking strategies
  4. Embedding models and vector indexing
  5. Retrieval chain and generation with LangChain
  6. Retrieval quality testing
  7. Evaluation with RAGAS (faithfulness, relevancy, precision, recall)
  8. Deployment: Docker Compose and AWS Lambda

Architecture Overview

A production RAG system has two distinct phases that run at different times:

Indexing Pipeline (runs once, or on document updates)

Raw Documents → Load → Clean → Chunk → Embed → Store in Vector DB

Query Pipeline (runs on every user request)

User Query → Embed → Retrieve Top-K Chunks → Build Prompt → LLM → Answer

The key insight is that embedding quality and chunk design are fixed at indexing time. A poorly chunked document cannot be rescued at query time — which is why this tutorial spends significant time on those early steps.

Step 1: Environment Setup

We'll use Python 3.11+. Install dependencies in a virtual environment:

# Create isolated environment python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate # Core RAG stack pip install langchain langchain-openai langchain-community langchain-chroma # Document processing pip install pypdf unstructured[pdf] python-docx # Vector databases pip install chromadb # Local, open-source pip install pinecone # Cloud-managed (optional) # Evaluation framework pip install ragas datasets # Deployment dependencies pip install fastapi uvicorn mangum # mangum = Lambda ASGI adapter # Utilities pip install python-dotenv tiktoken

Create a .env file for your credentials:

# .env OPENAI_API_KEY=sk-... # Or use Ollama for free local inference PINECONE_API_KEY=... # Only needed if using Pinecone # For Ollama (free, local models): # Install from https://ollama.ai, then: # ollama pull nomic-embed-text (768-dim embeddings) # ollama pull llama3.2 (8B model, fast)

Step 2: Vector Database Setup

Option A: ChromaDB (Local / Docker)

ChromaDB is the best starting point — free, runs in-process or as a server, and needs zero cloud configuration. For local development, use embedded mode:

# chroma_setup.py import chromadb from chromadb.config import Settings # Embedded mode (single process, persisted to disk) client = chromadb.PersistentClient( path="./chroma_data", settings=Settings(anonymized_telemetry=False) ) # Create (or get existing) collection collection = client.get_or_create_collection( name="documents", metadata={"hnsw:space": "cosine"} # cosine similarity for semantic search ) print(f"Collection ready: {collection.name}") print(f"Documents indexed: {collection.count()}")

For a persistent server (shared across processes or Docker containers):

# Run ChromaDB as a standalone server: # docker run -p 8000:8000 chromadb/chroma # Connect from Python: import chromadb client = chromadb.HttpClient(host="localhost", port=8000) collection = client.get_or_create_collection("documents")

Option B: Pinecone (Cloud, Production Scale)

Pinecone excels when you have millions of documents or need managed replication. Create a free account at pinecone.io, then:

# pinecone_setup.py from pinecone import Pinecone, ServerlessSpec pc = Pinecone(api_key="your-api-key") # Create index (1536 dims = OpenAI text-embedding-3-small) # For nomic-embed-text: dimension=768 if "rag-docs" not in pc.list_indexes().names(): pc.create_index( name="rag-docs", dimension=1536, metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1") ) index = pc.Index("rag-docs") print(index.describe_index_stats()) # Output: {'dimension': 1536, 'total_vector_count': 0, ...}
Cost note: Pinecone Serverless charges $0.096 per 1M reads and $2/GB/month storage. A 10,000-document knowledge base costs roughly $2-5/month. For <1M documents, ChromaDB on a $5/month VPS is cheaper.

Step 3: Document Loading and Chunking Strategy

Chunking is the most consequential design decision in a RAG system. Chunks that are too small lose context; chunks too large dilute retrieval precision.

Loading Multiple Document Formats

# document_loader.py from langchain_community.document_loaders import ( PyPDFLoader, UnstructuredWordDocumentLoader, WebBaseLoader, DirectoryLoader, ) from pathlib import Path def load_documents(source_dir: str = "./docs") -> list: """Load all documents from a directory, auto-detecting format.""" loaders = { "**/*.pdf": PyPDFLoader, "**/*.docx": UnstructuredWordDocumentLoader, } all_docs = [] for pattern, loader_cls in loaders.items(): loader = DirectoryLoader( source_dir, glob=pattern, loader_cls=loader_cls, show_progress=True, ) docs = loader.load() all_docs.extend(docs) print(f"Loaded {len(docs)} pages from {pattern} files") # Add metadata for filtering later for doc in all_docs: doc.metadata["ingested_at"] = "2026-04-09" print(f"\nTotal: {len(all_docs)} document pages loaded") return all_docs docs = load_documents("./docs") # Output: # Loaded 45 pages from **/*.pdf files # Loaded 12 pages from **/*.docx files # Total: 57 document pages loaded

Strategy 1: Recursive Character Splitting (Baseline)

from langchain.text_splitter import RecursiveCharacterTextSplitter # Good default for most document types splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Target chunk size in characters chunk_overlap=200, # Overlap preserves context across boundaries length_function=len, separators=[ "\n\n", # Prefer splitting on paragraph breaks "\n", # Then line breaks ". ", # Then sentence boundaries " ", # Then words "", # Character fallback ], ) chunks = splitter.split_documents(docs) print(f"Split into {len(chunks)} chunks") print(f"Avg chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars") # Output: # Split into 341 chunks # Avg chunk size: 847 chars

Strategy 2: Semantic Chunking (Better Recall)

Semantic chunking splits on topic boundaries detected by embedding similarity, rather than fixed character counts. It adds 15-25% to context recall in benchmarks because related content stays together.

from langchain_experimental.text_splitter import SemanticChunker from langchain_openai import OpenAIEmbeddings embeddings = OpenAIEmbeddings(model="text-embedding-3-small") semantic_splitter = SemanticChunker( embeddings, breakpoint_threshold_type="percentile", # Split when similarity drops below 95th percentile breakpoint_threshold_amount=95, ) semantic_chunks = semantic_splitter.split_documents(docs) print(f"Semantic chunks: {len(semantic_chunks)}") print(f"Avg chunk size: {sum(len(c.page_content) for c in semantic_chunks) // len(semantic_chunks)} chars") # Output (chunks are variable size, topic-aligned): # Semantic chunks: 198 # Avg chunk size: 1423 chars # Trade-off: ~2x more tokens to embed, but much better retrieval quality # Cost: ~$0.003 per 1M chars with text-embedding-3-small

Chunk Size Comparison

Document TypeRecommended Chunk SizeOverlapSplitter
Technical docs / APIs500-800 chars100-150Recursive
Legal / contracts1500-2000 chars300-400Recursive (sentence)
Research papersTopic-basedN/ASemantic
Customer support FAQsOne Q&A per chunk0Custom (split on Q:)
Code filesFunction / class0-50RecursiveCharacter (code)

Step 4: Embedding and Vector Indexing

# indexer.py from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma import os # Initialize embedding model embeddings = OpenAIEmbeddings( model="text-embedding-3-small", # 1536-dim, $0.02 per 1M tokens # Alternative: text-embedding-3-large (3072-dim, higher accuracy, 5x cost) ) # Build (or load) vector store PERSIST_DIR = "./chroma_data" if os.path.exists(PERSIST_DIR) and os.listdir(PERSIST_DIR): print("Loading existing vector store...") vectorstore = Chroma( persist_directory=PERSIST_DIR, embedding_function=embeddings, collection_name="documents", ) else: print(f"Indexing {len(chunks)} chunks...") vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory=PERSIST_DIR, collection_name="documents", collection_metadata={"hnsw:space": "cosine"}, ) print("Indexing complete.") print(f"Vector store ready: {vectorstore._collection.count()} vectors") # Output: Vector store ready: 341 vectors

Using free local embeddings with Ollama:

# First pull the model: # ollama pull nomic-embed-text from langchain_community.embeddings import OllamaEmbeddings embeddings = OllamaEmbeddings(model="nomic-embed-text") # 768-dim, free, local # Rest of the indexing code is identical

Step 5: Retrieval Chain and Generation

# rag_chain.py from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough, RunnableParallel llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) # Retriever: MMR (Maximal Marginal Relevance) reduces duplicate chunks retriever = vectorstore.as_retriever( search_type="mmr", # Balances relevance + diversity search_kwargs={ "k": 5, # Return 5 chunks "fetch_k": 20, # Consider 20 candidates, pick diverse 5 "lambda_mult": 0.7, # 0=max diversity, 1=max relevance }, ) SYSTEM_PROMPT = """You are a helpful assistant. Answer the user's question using ONLY the context below. If the context does not contain enough information, say "I don't have enough information to answer this." Do not make up information or draw from outside knowledge. Context: {context}""" prompt = ChatPromptTemplate.from_messages([ ("system", SYSTEM_PROMPT), ("human", "{question}"), ]) def format_docs(docs: list) -> str: """Format retrieved documents with source attribution.""" parts = [] for i, doc in enumerate(docs, 1): source = doc.metadata.get("source", "unknown") page = doc.metadata.get("page", "") label = f"[{i}] {source}" + (f" p.{page}" if page else "") parts.append(f"{label}\n{doc.page_content}") return "\n\n---\n\n".join(parts) # Chain with source tracking rag_chain_with_sources = RunnableParallel( answer=( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ), sources=(retriever), ) # Query result = rag_chain_with_sources.invoke("What is the refund policy?") print(f"Answer:\n{result['answer']}\n") print("Sources:") for doc in result["sources"]: print(f" - {doc.metadata.get('source')} (p.{doc.metadata.get('page', '?')})") # Expected output: # Answer: # According to the refund policy section, customers may request a full refund # within 30 days of purchase if the product is unused and in original condition. # # Sources: # - terms_and_conditions.pdf (p.4) # - faq.pdf (p.12)

Step 6: Retrieval Quality Testing

Before evaluating with RAGAS, manually test your retriever to catch obvious configuration problems. This takes 10 minutes and catches 80% of issues.

# retrieval_test.py from typing import NamedTuple class RetrievalTest(NamedTuple): query: str expected_keywords: list[str] # Words that must appear in retrieved chunks expected_k: int = 3 # Minimum chunks expected with relevant content RETRIEVAL_TESTS = [ RetrievalTest( query="What is the refund policy?", expected_keywords=["refund", "return", "days"], expected_k=2, ), RetrievalTest( query="How do I reset my password?", expected_keywords=["password", "reset", "email"], expected_k=1, ), RetrievalTest( query="What payment methods are accepted?", expected_keywords=["payment", "credit card", "paypal"], expected_k=2, ), ] def run_retrieval_tests(retriever, tests: list[RetrievalTest]) -> dict: """Run retrieval tests and report pass/fail.""" results = {"passed": 0, "failed": 0, "details": []} for test in tests: docs = retriever.invoke(test.query) combined_text = " ".join(d.page_content.lower() for d in docs) # Check all expected keywords appear in retrieved content keywords_found = {kw: kw.lower() in combined_text for kw in test.expected_keywords} all_found = all(keywords_found.values()) has_enough_docs = len(docs) >= test.expected_k passed = all_found and has_enough_docs results["passed" if passed else "failed"] += 1 results["details"].append({ "query": test.query, "passed": passed, "chunks_retrieved": len(docs), "keywords_found": keywords_found, }) return results report = run_retrieval_tests(retriever, RETRIEVAL_TESTS) for detail in report["details"]: status = "PASS" if detail["passed"] else "FAIL" print(f"[{status}] {detail['query']}") if not detail["passed"]: missing = [k for k, v in detail["keywords_found"].items() if not v] print(f" Missing keywords: {missing}") print(f" Chunks retrieved: {detail['chunks_retrieved']}") print(f"\nResults: {report['passed']}/{len(RETRIEVAL_TESTS)} tests passed") # Output: # [PASS] What is the refund policy? # [PASS] How do I reset my password? # [FAIL] What payment methods are accepted? # Missing keywords: ['paypal'] # Chunks retrieved: 5 # Results: 2/3 tests passed

If a test fails, debug in this order: (1) verify the keyword exists in your documents, (2) increase k, (3) try a different query phrasing, (4) check chunk boundaries aren't splitting the keyword away from context.

Step 7: Evaluation with RAGAS

RAGAS (Retrieval Augmented Generation Assessment) measures four dimensions that matter in production. Unlike manual testing, RAGAS uses an LLM-as-judge approach to score at scale.

MetricWhat it measuresProduction target
FaithfulnessAnswer is grounded in retrieved context (no hallucination)> 0.85
Answer RelevancyAnswer actually addresses the question asked> 0.80
Context PrecisionRetrieved chunks are relevant (no noisy chunks)> 0.75
Context RecallAll relevant information was retrieved (none missed)> 0.70

Building a Test Dataset

# evaluation_dataset.py from datasets import Dataset # Build Q&A pairs from your documents # Ground truth answers come from the source documents evaluation_data = { "question": [ "What is the refund policy for digital products?", "How long does shipping take to Europe?", "Can I use the product commercially?", "What languages is customer support available in?", "Is there a free trial period?", ], "ground_truth": [ "Digital products are non-refundable except in cases of technical issues verified by our support team.", "Standard shipping to Europe takes 7-14 business days. Express shipping takes 3-5 business days.", "Yes, commercial use is permitted under the Professional and Enterprise license tiers.", "Customer support is available in English, French, Spanish, and German.", "Yes, all plans include a 14-day free trial with full feature access and no credit card required.", ], # These will be filled automatically during evaluation "contexts": [], "answer": [], } eval_dataset = Dataset.from_dict(evaluation_data) print(f"Evaluation dataset: {len(eval_dataset)} questions")

Running RAGAS Evaluation

# run_evaluation.py from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, ) from ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper from langchain_openai import ChatOpenAI, OpenAIEmbeddings # Generate answers and collect contexts for each question def prepare_eval_dataset(dataset, retriever, rag_chain): """Run RAG over test questions to populate contexts and answers.""" contexts_list = [] answers_list = [] for question in dataset["question"]: # Retrieve context docs = retriever.invoke(question) contexts = [doc.page_content for doc in docs] contexts_list.append(contexts) # Generate answer answer = rag_chain.invoke(question) answers_list.append(answer) dataset = dataset.add_column("contexts", contexts_list) dataset = dataset.add_column("answer", answers_list) return dataset # Prepare dataset with generated answers eval_ready = prepare_eval_dataset(eval_dataset, retriever, rag_chain_with_sources["answer"]) # Run RAGAS evaluation ragas_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini")) ragas_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings()) results = evaluate( dataset=eval_ready, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], llm=ragas_llm, embeddings=ragas_embeddings, ) print("\n=== RAGAS Evaluation Results ===") print(f"Faithfulness: {results['faithfulness']:.3f} (target: >0.85)") print(f"Answer Relevancy: {results['answer_relevancy']:.3f} (target: >0.80)") print(f"Context Precision: {results['context_precision']:.3f} (target: >0.75)") print(f"Context Recall: {results['context_recall']:.3f} (target: >0.70)") # Typical output for a well-tuned system: # === RAGAS Evaluation Results === # Faithfulness: 0.912 (target: >0.85) # Answer Relevancy: 0.847 (target: >0.80) # Context Precision: 0.783 (target: >0.75) # Context Recall: 0.741 (target: >0.70)

Diagnosing and Improving Low Scores

# diagnosis.py def diagnose_ragas_failures(results, threshold=0.75): """Print actionable remediation for each failing metric.""" metrics = { "faithfulness": { "score": results["faithfulness"], "fixes": [ "Tighten the system prompt: 'Answer ONLY using the context. Never add information.'", "Reduce temperature to 0 for deterministic, grounded answers", "Add a post-processing step to verify every claim appears in context", ], }, "answer_relevancy": { "score": results["answer_relevancy"], "fixes": [ "Improve query rewriting — add a step to rephrase ambiguous questions", "Adjust prompt to require answering the specific question asked", "Check if low-relevancy answers are caused by off-topic retrieved chunks", ], }, "context_precision": { "score": results["context_precision"], "fixes": [ "Reduce k (retrieved chunks) — fewer but better chunks improve precision", "Add metadata filters to narrow search scope", "Try MMR search_type to reduce duplicate/noisy chunks", "Switch to hybrid search (BM25 + semantic) for keyword-heavy queries", ], }, "context_recall": { "score": results["context_recall"], "fixes": [ "Increase k to retrieve more candidate chunks", "Improve chunking — large chunks may split relevant content", "Use semantic chunking to preserve topic boundaries", "Add query expansion (generate multiple phrasings of the query)", ], }, } print("\n=== Diagnosis Report ===") for metric, data in metrics.items(): if data["score"] < threshold: print(f"\nFAIL: {metric} = {data['score']:.3f}") print("Recommended fixes:") for fix in data["fixes"]: print(f" • {fix}") diagnose_ragas_failures(results)

Step 8: Deployment

Option A: Docker Compose (Local / VPS)

Package the RAG API as a FastAPI service alongside ChromaDB. This runs identically in development, on a VPS, or in a container orchestrator.

# app/main.py from fastapi import FastAPI, HTTPException from pydantic import BaseModel from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_chroma import Chroma from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough import os app = FastAPI(title="RAG API", version="1.0.0") class QueryRequest(BaseModel): question: str k: int = 5 class QueryResponse(BaseModel): answer: str sources: list[dict] latency_ms: float # Initialize components on startup @app.on_event("startup") async def startup(): global retriever, chain embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma( host=os.getenv("CHROMA_HOST", "chroma"), # Docker service name port=int(os.getenv("CHROMA_PORT", "8000")), collection_name="documents", embedding_function=embeddings, ) retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 5}) llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) prompt = ChatPromptTemplate.from_messages([ ("system", "Answer using ONLY the context below.\n\nContext:\n{context}"), ("human", "{question}"), ]) chain = ( {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)), "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) @app.post("/query", response_model=QueryResponse) async def query(request: QueryRequest): import time start = time.time() try: docs = retriever.invoke(request.question) answer = chain.invoke(request.question) except Exception as e: raise HTTPException(status_code=500, detail=str(e)) return QueryResponse( answer=answer, sources=[ {"source": d.metadata.get("source", ""), "page": d.metadata.get("page")} for d in docs ], latency_ms=(time.time() - start) * 1000, ) @app.get("/health") async def health(): return {"status": "ok"}
# docker-compose.yml version: "3.9" services: chroma: image: chromadb/chroma:latest ports: - "8000:8000" volumes: - chroma_data:/chroma/chroma environment: - CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER=chromadb.auth.token.TokenConfigServerAuthCredentialsProvider healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/heartbeat"] interval: 10s timeout: 5s retries: 3 rag_api: build: . ports: - "8080:8080" environment: - OPENAI_API_KEY=${OPENAI_API_KEY} - CHROMA_HOST=chroma - CHROMA_PORT=8000 depends_on: chroma: condition: service_healthy command: uvicorn app.main:app --host 0.0.0.0 --port 8080 volumes: chroma_data:
# Dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY app/ ./app/ # Start API CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]
# Build and run docker-compose up --build # Test curl -X POST http://localhost:8080/query \ -H "Content-Type: application/json" \ -d '{"question": "What is the refund policy?"}' # Response: # { # "answer": "Refunds are available within 30 days of purchase...", # "sources": [{"source": "terms.pdf", "page": 4}], # "latency_ms": 1247.3 # }

Option B: AWS Lambda (Serverless)

For serverless deployment, use Mangum to adapt FastAPI to Lambda's event format, and Pinecone (or ChromaDB on EFS) as the vector store. Lambda removes server management at the cost of cold starts.

# lambda_handler.py from mangum import Mangum from app.main import app # FastAPI app from above # Mangum wraps FastAPI for Lambda + API Gateway handler = Mangum(app, lifespan="off") # Deployment steps: # 1. Package dependencies into a Lambda layer or container image # 2. Set Lambda environment variables: OPENAI_API_KEY, PINECONE_API_KEY # 3. Connect to Pinecone instead of ChromaDB (ChromaDB on Lambda is complex) # 4. Set memory to 1024MB minimum (vector operations need RAM) # 5. Set timeout to 30 seconds (RAGAS queries can be slow)
# serverless.yml (Serverless Framework) service: rag-api provider: name: aws runtime: python3.11 region: eu-west-1 memorySize: 1024 # MB — vector operations need RAM timeout: 30 # Seconds — allow for cold start + LLM generation environment: OPENAI_API_KEY: ${env:OPENAI_API_KEY} PINECONE_API_KEY: ${env:PINECONE_API_KEY} VECTOR_STORE: pinecone # Use Pinecone for serverless (no persistent FS) functions: api: handler: lambda_handler.handler events: - httpApi: path: /{proxy+} method: ANY layers: - ${cf:rag-dependencies-layer.LambdaLayerArn} # Deploy: # npm install -g serverless # serverless deploy --stage prod
Lambda cost estimate: 1,000 daily RAG queries × 30s × 1024MB = ~$2/month for compute. Pinecone adds ~$2/month for a small index. Total: ~$4/month for a production-ready serverless RAG API serving 30K queries/month.

Performance Optimization Checklist

  • Cache embeddings: Use CacheBackedEmbeddings with Redis to avoid re-embedding identical queries — saves 60-80% on embedding API costs for production traffic
  • Async retrieval: Use retriever.ainvoke() and llm.ainvoke() for non-blocking I/O in FastAPI — supports 3-5x more concurrent requests on the same hardware
  • Batch indexing: When indexing >10K documents, use vectorstore.add_documents() in batches of 100 to avoid rate limits
  • Reduce k first: Lower k (retrieved chunks) before any other optimization — going from k=10 to k=4 halves prompt tokens and typically improves precision
  • Use smaller generation models: gpt-4o-mini costs 30x less than gpt-4o with 85-90% of the answer quality for factual retrieval tasks

Next Steps

  • Hybrid search: Combine BM25 keyword search with semantic search using EnsembleRetriever — improves precision on exact-match queries by 20-30%
  • Reranking: Add a Cohere or cross-encoder reranker after retrieval to re-score chunks — consistently improves answer quality at ~$0.001 per query extra cost
  • Multi-modal RAG: Extend to images and tables using GPT-4o vision or Unstructured's table extraction
  • Agentic RAG: Use LangGraph to build a retrieval agent that decides when to search, what to search for, and when it has enough context

For structured professional training on these topics:

Frequently Asked Questions

What is the difference between ChromaDB and Pinecone for RAG?

ChromaDB is a free, open-source vector database that runs locally (or in Docker). It's ideal for development, small-to-medium datasets (<10M vectors), and privacy-sensitive deployments. Pinecone is a managed cloud service with automatic scaling, serverless billing (~$0.096 per 1M reads), and built-in replication — best for production systems with millions of documents or teams without infrastructure expertise. You can build with ChromaDB and migrate to Pinecone later without changing your LangChain retriever code.

What RAGAS scores should I target before going to production?

Industry benchmarks for production RAG systems: Faithfulness > 0.85 (LLM's answer is grounded in retrieved context), Answer Relevancy > 0.80 (response addresses the question), Context Precision > 0.75 (retrieved chunks are relevant), Context Recall > 0.70 (enough relevant context is retrieved). If any score is below threshold, diagnose the specific failure: low context recall → increase k or improve embeddings; low faithfulness → improve the system prompt to reduce hallucination; low answer relevancy → refine query rewriting.

How do I choose chunk size? Is there a formula?

No universal formula, but a practical starting heuristic: chunk size ≈ 75-85% of the tokens in the LLM's context window you're allocating per chunk, converted to characters. For most retrieval tasks start with 1000 characters / 200 overlap and benchmark. If your queries are short (< 5 words), smaller chunks (500 chars) retrieve more precisely. If queries are complex multi-sentence questions, larger chunks (1500-2000) preserve reasoning context. Semantic chunking (splitting on topic boundaries rather than character counts) consistently outperforms fixed-size splitting by 15-25% on context recall — worth the extra implementation time.

What are the cold start costs for AWS Lambda with a vector database?

Lambda cold starts for a Python RAG function add 800ms-2s depending on package size. Mitigation: use Lambda layers for heavy dependencies (LangChain, numpy), keep function package under 50MB, and set provisioned concurrency (1-2 instances, ~$15/month) for latency-critical paths. The vector database call (ChromaDB EFS or Pinecone) adds 50-300ms per query. Total P95 latency target: < 3 seconds for the full RAG cycle (embed query → retrieve → generate).

Can I run the full RAG pipeline locally without any API costs?

Yes. Use Ollama for local LLM inference (llama3.2 or mistral) and local embeddings (nomic-embed-text), plus ChromaDB as the vector store. All free. Run `ollama pull llama3.2` and `ollama pull nomic-embed-text`, then replace the OpenAI clients with OllamaEmbeddings and ChatOllama in LangChain. On an M2 MacBook Pro, expect 2-4 tokens/sec for llama3.2 70B, or 15-20 tokens/sec for llama3.2 8B. For the Docker deployment in this tutorial, add an Ollama service to the compose file and point your RAG service to it.

Build Production RAG Systems

Professional training for developers building RAG pipelines, LLM applications, and AI agents.

View Training ProgramsContact Us