RAG Pipeline End-to-End: Build, Evaluate, and Deploy (2026)

Architecture Overview

A production RAG system has two distinct phases that run at different times:

Indexing Pipeline (runs once, or on document updates)

Raw Documents → Load → Clean → Chunk → Embed → Store in Vector DB

Query Pipeline (runs on every user request)

User Query → Embed → Retrieve Top-K Chunks → Build Prompt → LLM → Answer

The key insight is that embedding quality and chunk design are fixed at indexing time. A poorly chunked document cannot be rescued at query time — which is why this tutorial spends significant time on those early steps.

Step 1: Environment Setup

We'll use Python 3.11+. Install dependencies in a virtual environment:

# Create isolated environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Core RAG stack
pip install langchain langchain-openai langchain-community langchain-chroma

# Document processing
pip install pypdf unstructured[pdf] python-docx

# Vector databases
pip install chromadb              # Local, open-source
pip install pinecone              # Cloud-managed (optional)

# Evaluation framework
pip install ragas datasets

# Deployment dependencies
pip install fastapi uvicorn mangum  # mangum = Lambda ASGI adapter

# Utilities
pip install python-dotenv tiktoken

Create a .env file for your credentials:

# .env
OPENAI_API_KEY=sk-...             # Or use Ollama for free local inference
PINECONE_API_KEY=...              # Only needed if using Pinecone

# For Ollama (free, local models):
# Install from https://ollama.ai, then:
# ollama pull nomic-embed-text   (768-dim embeddings)
# ollama pull llama3.2           (8B model, fast)

Step 2: Vector Database Setup

Option A: ChromaDB (Local / Docker)

ChromaDB is the best starting point — free, runs in-process or as a server, and needs zero cloud configuration. For local development, use embedded mode:

# chroma_setup.py
import chromadb
from chromadb.config import Settings

# Embedded mode (single process, persisted to disk)
client = chromadb.PersistentClient(
    path="./chroma_data",
    settings=Settings(anonymized_telemetry=False)
)

# Create (or get existing) collection
collection = client.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}  # cosine similarity for semantic search
)

print(f"Collection ready: {collection.name}")
print(f"Documents indexed: {collection.count()}")

For a persistent server (shared across processes or Docker containers):

# Run ChromaDB as a standalone server:
# docker run -p 8000:8000 chromadb/chroma

# Connect from Python:
import chromadb

client = chromadb.HttpClient(host="localhost", port=8000)
collection = client.get_or_create_collection("documents")

Option B: Pinecone (Cloud, Production Scale)

Pinecone excels when you have millions of documents or need managed replication. Create a free account at pinecone.io, then:

# pinecone_setup.py
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="your-api-key")

# Create index (1536 dims = OpenAI text-embedding-3-small)
# For nomic-embed-text: dimension=768
if "rag-docs" not in pc.list_indexes().names():
    pc.create_index(
        name="rag-docs",
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index("rag-docs")
print(index.describe_index_stats())
# Output: {'dimension': 1536, 'total_vector_count': 0, ...}

Cost note:Pinecone Serverless charges $0.096 per 1M reads and $2/GB/month storage. A 10,000-document knowledge base costs roughly $2-5/month. For <1M documents, ChromaDB on a $5/month VPS is cheaper.

Step 3: Document Loading and Chunking Strategy

Chunking is the most consequential design decision in a RAG system. Chunks that are too small lose context; chunks too large dilute retrieval precision.

Loading Multiple Document Formats

# document_loader.py
from langchain_community.document_loaders import (
    PyPDFLoader,
    UnstructuredWordDocumentLoader,
    WebBaseLoader,
    DirectoryLoader,
)
from pathlib import Path


def load_documents(source_dir: str = "./docs") -> list:
    """Load all documents from a directory, auto-detecting format."""
    loaders = {
        "**/*.pdf": PyPDFLoader,
        "**/*.docx": UnstructuredWordDocumentLoader,
    }

    all_docs = []
    for pattern, loader_cls in loaders.items():
        loader = DirectoryLoader(
            source_dir,
            glob=pattern,
            loader_cls=loader_cls,
            show_progress=True,
        )
        docs = loader.load()
        all_docs.extend(docs)
        print(f"Loaded {len(docs)} pages from {pattern} files")

    # Add metadata for filtering later
    for doc in all_docs:
        doc.metadata["ingested_at"] = "2026-04-09"

    print(f"\nTotal: {len(all_docs)} document pages loaded")
    return all_docs


docs = load_documents("./docs")
# Output:
# Loaded 45 pages from **/*.pdf files
# Loaded 12 pages from **/*.docx files
# Total: 57 document pages loaded

Strategy 1: Recursive Character Splitting (Baseline)

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Good default for most document types
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # Target chunk size in characters
    chunk_overlap=200,     # Overlap preserves context across boundaries
    length_function=len,
    separators=[
        "\n\n",          # Prefer splitting on paragraph breaks
        "\n",             # Then line breaks
        ". ",              # Then sentence boundaries
        " ",               # Then words
        "",                # Character fallback
    ],
)

chunks = splitter.split_documents(docs)
print(f"Split into {len(chunks)} chunks")
print(f"Avg chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")

# Output:
# Split into 341 chunks
# Avg chunk size: 847 chars

Strategy 2: Semantic Chunking (Better Recall)

Semantic chunking splits on topic boundaries detected by embedding similarity, rather than fixed character counts. It adds 15-25% to context recall in benchmarks because related content stays together.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",  # Split when similarity drops below 95th percentile
    breakpoint_threshold_amount=95,
)

semantic_chunks = semantic_splitter.split_documents(docs)
print(f"Semantic chunks: {len(semantic_chunks)}")
print(f"Avg chunk size: {sum(len(c.page_content) for c in semantic_chunks) // len(semantic_chunks)} chars")

# Output (chunks are variable size, topic-aligned):
# Semantic chunks: 198
# Avg chunk size: 1423 chars

# Trade-off: ~2x more tokens to embed, but much better retrieval quality
# Cost: ~$0.003 per 1M chars with text-embedding-3-small

Chunk Size Comparison

Document Type	Recommended Chunk Size	Overlap	Splitter
Technical docs / APIs	500-800 chars	100-150	Recursive
Legal / contracts	1500-2000 chars	300-400	Recursive (sentence)
Research papers	Topic-based	N/A	Semantic
Customer support FAQs	One Q&A per chunk	0	Custom (split on Q:)
Code files	Function / class	0-50	RecursiveCharacter (code)

Step 4: Embedding and Vector Indexing

# indexer.py
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
import os

# Initialize embedding model
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",  # 1536-dim, $0.02 per 1M tokens
    # Alternative: text-embedding-3-large (3072-dim, higher accuracy, 5x cost)
)

# Build (or load) vector store
PERSIST_DIR = "./chroma_data"

if os.path.exists(PERSIST_DIR) and os.listdir(PERSIST_DIR):
    print("Loading existing vector store...")
    vectorstore = Chroma(
        persist_directory=PERSIST_DIR,
        embedding_function=embeddings,
        collection_name="documents",
    )
else:
    print(f"Indexing {len(chunks)} chunks...")
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=PERSIST_DIR,
        collection_name="documents",
        collection_metadata={"hnsw:space": "cosine"},
    )
    print("Indexing complete.")

print(f"Vector store ready: {vectorstore._collection.count()} vectors")
# Output: Vector store ready: 341 vectors

Using free local embeddings with Ollama:

# First pull the model:
# ollama pull nomic-embed-text

from langchain_community.embeddings import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")  # 768-dim, free, local
# Rest of the indexing code is identical

Step 5: Retrieval Chain and Generation

# rag_chain.py
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Retriever: MMR (Maximal Marginal Relevance) reduces duplicate chunks
retriever = vectorstore.as_retriever(
    search_type="mmr",          # Balances relevance + diversity
    search_kwargs={
        "k": 5,                 # Return 5 chunks
        "fetch_k": 20,          # Consider 20 candidates, pick diverse 5
        "lambda_mult": 0.7,     # 0=max diversity, 1=max relevance
    },
)

SYSTEM_PROMPT = """You are a helpful assistant. Answer the user's question using ONLY the context below.
If the context does not contain enough information, say "I don't have enough information to answer this."
Do not make up information or draw from outside knowledge.

Context:
{context}"""

prompt = ChatPromptTemplate.from_messages([
    ("system", SYSTEM_PROMPT),
    ("human", "{question}"),
])


def format_docs(docs: list) -> str:
    """Format retrieved documents with source attribution."""
    parts = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("source", "unknown")
        page = doc.metadata.get("page", "")
        label = f"[{i}] {source}" + (f" p.{page}" if page else "")
        parts.append(f"{label}\n{doc.page_content}")
    return "\n\n---\n\n".join(parts)


# Chain with source tracking
rag_chain_with_sources = RunnableParallel(
    answer=(
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    ),
    sources=(retriever),
)

# Query
result = rag_chain_with_sources.invoke("What is the refund policy?")
print(f"Answer:\n{result['answer']}\n")
print("Sources:")
for doc in result["sources"]:
    print(f"  - {doc.metadata.get('source')} (p.{doc.metadata.get('page', '?')})")

# Expected output:
# Answer:
# According to the refund policy section, customers may request a full refund
# within 30 days of purchase if the product is unused and in original condition.
#
# Sources:
#   - terms_and_conditions.pdf (p.4)
#   - faq.pdf (p.12)

Step 6: Retrieval Quality Testing

Before evaluating with RAGAS, manually test your retriever to catch obvious configuration problems. This takes 10 minutes and catches 80% of issues.

# retrieval_test.py
from typing import NamedTuple


class RetrievalTest(NamedTuple):
    query: str
    expected_keywords: list[str]  # Words that must appear in retrieved chunks
    expected_k: int = 3           # Minimum chunks expected with relevant content


RETRIEVAL_TESTS = [
    RetrievalTest(
        query="What is the refund policy?",
        expected_keywords=["refund", "return", "days"],
        expected_k=2,
    ),
    RetrievalTest(
        query="How do I reset my password?",
        expected_keywords=["password", "reset", "email"],
        expected_k=1,
    ),
    RetrievalTest(
        query="What payment methods are accepted?",
        expected_keywords=["payment", "credit card", "paypal"],
        expected_k=2,
    ),
]


def run_retrieval_tests(retriever, tests: list[RetrievalTest]) -> dict:
    """Run retrieval tests and report pass/fail."""
    results = {"passed": 0, "failed": 0, "details": []}

    for test in tests:
        docs = retriever.invoke(test.query)
        combined_text = " ".join(d.page_content.lower() for d in docs)

        # Check all expected keywords appear in retrieved content
        keywords_found = {kw: kw.lower() in combined_text for kw in test.expected_keywords}
        all_found = all(keywords_found.values())
        has_enough_docs = len(docs) >= test.expected_k

        passed = all_found and has_enough_docs
        results["passed" if passed else "failed"] += 1

        results["details"].append({
            "query": test.query,
            "passed": passed,
            "chunks_retrieved": len(docs),
            "keywords_found": keywords_found,
        })

    return results


report = run_retrieval_tests(retriever, RETRIEVAL_TESTS)

for detail in report["details"]:
    status = "PASS" if detail["passed"] else "FAIL"
    print(f"[{status}] {detail['query']}")
    if not detail["passed"]:
        missing = [k for k, v in detail["keywords_found"].items() if not v]
        print(f"       Missing keywords: {missing}")
        print(f"       Chunks retrieved: {detail['chunks_retrieved']}")

print(f"\nResults: {report['passed']}/{len(RETRIEVAL_TESTS)} tests passed")

# Output:
# [PASS] What is the refund policy?
# [PASS] How do I reset my password?
# [FAIL] What payment methods are accepted?
#        Missing keywords: ['paypal']
#        Chunks retrieved: 5
# Results: 2/3 tests passed

If a test fails, debug in this order: (1) verify the keyword exists in your documents, (2) increase k, (3) try a different query phrasing, (4) check chunk boundaries aren't splitting the keyword away from context.

Step 7: Evaluation with RAGAS

RAGAS (Retrieval Augmented Generation Assessment) measures four dimensions that matter in production. Unlike manual testing, RAGAS uses an LLM-as-judge approach to score at scale.

Metric	What it measures	Production target
Faithfulness	Answer is grounded in retrieved context (no hallucination)	> 0.85
Answer Relevancy	Answer actually addresses the question asked	> 0.80
Context Precision	Retrieved chunks are relevant (no noisy chunks)	> 0.75
Context Recall	All relevant information was retrieved (none missed)	> 0.70

Building a Test Dataset

# evaluation_dataset.py
from datasets import Dataset

# Build Q&A pairs from your documents
# Ground truth answers come from the source documents
evaluation_data = {
    "question": [
        "What is the refund policy for digital products?",
        "How long does shipping take to Europe?",
        "Can I use the product commercially?",
        "What languages is customer support available in?",
        "Is there a free trial period?",
    ],
    "ground_truth": [
        "Digital products are non-refundable except in cases of technical issues verified by our support team.",
        "Standard shipping to Europe takes 7-14 business days. Express shipping takes 3-5 business days.",
        "Yes, commercial use is permitted under the Professional and Enterprise license tiers.",
        "Customer support is available in English, French, Spanish, and German.",
        "Yes, all plans include a 14-day free trial with full feature access and no credit card required.",
    ],
    # These will be filled automatically during evaluation
    "contexts": [],
    "answer": [],
}

eval_dataset = Dataset.from_dict(evaluation_data)
print(f"Evaluation dataset: {len(eval_dataset)} questions")

Running RAGAS Evaluation

# run_evaluation.py
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings


# Generate answers and collect contexts for each question
def prepare_eval_dataset(dataset, retriever, rag_chain):
    """Run RAG over test questions to populate contexts and answers."""
    contexts_list = []
    answers_list = []

    for question in dataset["question"]:
        # Retrieve context
        docs = retriever.invoke(question)
        contexts = [doc.page_content for doc in docs]
        contexts_list.append(contexts)

        # Generate answer
        answer = rag_chain.invoke(question)
        answers_list.append(answer)

    dataset = dataset.add_column("contexts", contexts_list)
    dataset = dataset.add_column("answer", answers_list)
    return dataset


# Prepare dataset with generated answers
eval_ready = prepare_eval_dataset(eval_dataset, retriever, rag_chain_with_sources["answer"])

# Run RAGAS evaluation
ragas_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
ragas_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

results = evaluate(
    dataset=eval_ready,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=ragas_llm,
    embeddings=ragas_embeddings,
)

print("\n=== RAGAS Evaluation Results ===")
print(f"Faithfulness:      {results['faithfulness']:.3f}  (target: >0.85)")
print(f"Answer Relevancy:  {results['answer_relevancy']:.3f}  (target: >0.80)")
print(f"Context Precision: {results['context_precision']:.3f}  (target: >0.75)")
print(f"Context Recall:    {results['context_recall']:.3f}  (target: >0.70)")

# Typical output for a well-tuned system:
# === RAGAS Evaluation Results ===
# Faithfulness:      0.912  (target: >0.85)
# Answer Relevancy:  0.847  (target: >0.80)
# Context Precision: 0.783  (target: >0.75)
# Context Recall:    0.741  (target: >0.70)

Diagnosing and Improving Low Scores

# diagnosis.py
def diagnose_ragas_failures(results, threshold=0.75):
    """Print actionable remediation for each failing metric."""
    metrics = {
        "faithfulness": {
            "score": results["faithfulness"],
            "fixes": [
                "Tighten the system prompt: 'Answer ONLY using the context. Never add information.'",
                "Reduce temperature to 0 for deterministic, grounded answers",
                "Add a post-processing step to verify every claim appears in context",
            ],
        },
        "answer_relevancy": {
            "score": results["answer_relevancy"],
            "fixes": [
                "Improve query rewriting — add a step to rephrase ambiguous questions",
                "Adjust prompt to require answering the specific question asked",
                "Check if low-relevancy answers are caused by off-topic retrieved chunks",
            ],
        },
        "context_precision": {
            "score": results["context_precision"],
            "fixes": [
                "Reduce k (retrieved chunks) — fewer but better chunks improve precision",
                "Add metadata filters to narrow search scope",
                "Try MMR search_type to reduce duplicate/noisy chunks",
                "Switch to hybrid search (BM25 + semantic) for keyword-heavy queries",
            ],
        },
        "context_recall": {
            "score": results["context_recall"],
            "fixes": [
                "Increase k to retrieve more candidate chunks",
                "Improve chunking — large chunks may split relevant content",
                "Use semantic chunking to preserve topic boundaries",
                "Add query expansion (generate multiple phrasings of the query)",
            ],
        },
    }

    print("\n=== Diagnosis Report ===")
    for metric, data in metrics.items():
        if data["score"] < threshold:
            print(f"\nFAIL: {metric} = {data['score']:.3f}")
            print("Recommended fixes:")
            for fix in data["fixes"]:
                print(f"  • {fix}")

diagnose_ragas_failures(results)

Step 8: Deployment

Option A: Docker Compose (Local / VPS)

Package the RAG API as a FastAPI service alongside ChromaDB. This runs identically in development, on a VPS, or in a container orchestrator.

# app/main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import os

app = FastAPI(title="RAG API", version="1.0.0")


class QueryRequest(BaseModel):
    question: str
    k: int = 5


class QueryResponse(BaseModel):
    answer: str
    sources: list[dict]
    latency_ms: float


# Initialize components on startup
@app.on_event("startup")
async def startup():
    global retriever, chain

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = Chroma(
        host=os.getenv("CHROMA_HOST", "chroma"),  # Docker service name
        port=int(os.getenv("CHROMA_PORT", "8000")),
        collection_name="documents",
        embedding_function=embeddings,
    )

    retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 5})
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer using ONLY the context below.\n\nContext:\n{context}"),
        ("human", "{question}"),
    ])

    chain = (
        {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
         "question": RunnablePassthrough()}
        | prompt | llm | StrOutputParser()
    )


@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
    import time
    start = time.time()

    try:
        docs = retriever.invoke(request.question)
        answer = chain.invoke(request.question)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

    return QueryResponse(
        answer=answer,
        sources=[
            {"source": d.metadata.get("source", ""), "page": d.metadata.get("page")}
            for d in docs
        ],
        latency_ms=(time.time() - start) * 1000,
    )


@app.get("/health")
async def health():
    return {"status": "ok"}

# docker-compose.yml
version: "3.9"

services:
  chroma:
    image: chromadb/chroma:latest
    ports:
      - "8000:8000"
    volumes:
      - chroma_data:/chroma/chroma
    environment:
      - CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER=chromadb.auth.token.TokenConfigServerAuthCredentialsProvider
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/heartbeat"]
      interval: 10s
      timeout: 5s
      retries: 3

  rag_api:
    build: .
    ports:
      - "8080:8080"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - CHROMA_HOST=chroma
      - CHROMA_PORT=8000
    depends_on:
      chroma:
        condition: service_healthy
    command: uvicorn app.main:app --host 0.0.0.0 --port 8080

volumes:
  chroma_data:

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app/ ./app/

# Start API
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

# Build and run
docker-compose up --build

# Test
curl -X POST http://localhost:8080/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the refund policy?"}'

# Response:
# {
#   "answer": "Refunds are available within 30 days of purchase...",
#   "sources": [{"source": "terms.pdf", "page": 4}],
#   "latency_ms": 1247.3
# }

Option B: AWS Lambda (Serverless)

For serverless deployment, use Mangumto adapt FastAPI to Lambda's event format, and Pinecone (or ChromaDB on EFS) as the vector store. Lambda removes server management at the cost of cold starts.

# lambda_handler.py
from mangum import Mangum
from app.main import app  # FastAPI app from above

# Mangum wraps FastAPI for Lambda + API Gateway
handler = Mangum(app, lifespan="off")

# Deployment steps:
# 1. Package dependencies into a Lambda layer or container image
# 2. Set Lambda environment variables: OPENAI_API_KEY, PINECONE_API_KEY
# 3. Connect to Pinecone instead of ChromaDB (ChromaDB on Lambda is complex)
# 4. Set memory to 1024MB minimum (vector operations need RAM)
# 5. Set timeout to 30 seconds (RAGAS queries can be slow)

# serverless.yml (Serverless Framework)
service: rag-api

provider:
  name: aws
  runtime: python3.11
  region: eu-west-1
  memorySize: 1024     # MB — vector operations need RAM
  timeout: 30          # Seconds — allow for cold start + LLM generation
  environment:
    OPENAI_API_KEY: ${env:OPENAI_API_KEY}
    PINECONE_API_KEY: ${env:PINECONE_API_KEY}
    VECTOR_STORE: pinecone    # Use Pinecone for serverless (no persistent FS)

functions:
  api:
    handler: lambda_handler.handler
    events:
      - httpApi:
          path: /{proxy+}
          method: ANY
    layers:
      - ${cf:rag-dependencies-layer.LambdaLayerArn}

# Deploy:
# npm install -g serverless
# serverless deploy --stage prod

Lambda cost estimate: 1,000 daily RAG queries × 30s × 1024MB = ~$2/month for compute. Pinecone adds ~$2/month for a small index. Total: ~$4/month for a production-ready serverless RAG API serving 30K queries/month.

Performance Optimization Checklist

Cache embeddings: Use CacheBackedEmbeddings with Redis to avoid re-embedding identical queries — saves 60-80% on embedding API costs for production traffic
Async retrieval: Use retriever.ainvoke() and llm.ainvoke() for non-blocking I/O in FastAPI — supports 3-5x more concurrent requests on the same hardware
Batch indexing: When indexing >10K documents, use vectorstore.add_documents() in batches of 100 to avoid rate limits
Reduce k first: Lower k (retrieved chunks) before any other optimization — going from k=10 to k=4 halves prompt tokens and typically improves precision
Use smaller generation models: gpt-4o-mini costs 30x less than gpt-4o with 85-90% of the answer quality for factual retrieval tasks

Next Steps

Hybrid search: Combine BM25 keyword search with semantic search using EnsembleRetriever — improves precision on exact-match queries by 20-30%
Reranking: Add a Cohere or cross-encoder reranker after retrieval to re-score chunks — consistently improves answer quality at ~$0.001 per query extra cost
Multi-modal RAG: Extend to images and tables using GPT-4o vision or Unstructured's table extraction
Agentic RAG: Use LangGraph to build a retrieval agent that decides when to search, what to search for, and when it has enough context

For structured professional training on these topics:

RAG and Agents in Production (3-day intensive): Advanced RAG patterns, LangGraph agents, vector database optimization, CI/CD for AI pipelines
Claude API for Developers (2 days): Build RAG systems with Claude's 200K-token context window and extended thinking

Frequently Asked Questions

What is the difference between ChromaDB and Pinecone for RAG?

ChromaDB is a free, open-source vector database that runs locally (or in Docker). It's ideal for development, small-to-medium datasets (<10M vectors), and privacy-sensitive deployments. Pinecone is a managed cloud service with automatic scaling, serverless billing (~$0.096 per 1M reads), and built-in replication — best for production systems with millions of documents or teams without infrastructure expertise. You can build with ChromaDB and migrate to Pinecone later without changing your LangChain retriever code.

What RAGAS scores should I target before going to production?

Industry benchmarks for production RAG systems: Faithfulness > 0.85 (LLM's answer is grounded in retrieved context), Answer Relevancy > 0.80 (response addresses the question), Context Precision > 0.75 (retrieved chunks are relevant), Context Recall > 0.70 (enough relevant context is retrieved). If any score is below threshold, diagnose the specific failure: low context recall → increase k or improve embeddings; low faithfulness → improve the system prompt to reduce hallucination; low answer relevancy → refine query rewriting.

How do I choose chunk size? Is there a formula?

No universal formula, but a practical starting heuristic: chunk size ≈ 75-85% of the tokens in the LLM's context window you're allocating per chunk, converted to characters. For most retrieval tasks start with 1000 characters / 200 overlap and benchmark. If your queries are short (< 5 words), smaller chunks (500 chars) retrieve more precisely. If queries are complex multi-sentence questions, larger chunks (1500-2000) preserve reasoning context. Semantic chunking (splitting on topic boundaries rather than character counts) consistently outperforms fixed-size splitting by 15-25% on context recall — worth the extra implementation time.

What are the cold start costs for AWS Lambda with a vector database?

Lambda cold starts for a Python RAG function add 800ms-2s depending on package size. Mitigation: use Lambda layers for heavy dependencies (LangChain, numpy), keep function package under 50MB, and set provisioned concurrency (1-2 instances, ~$15/month) for latency-critical paths. The vector database call (ChromaDB EFS or Pinecone) adds 50-300ms per query. Total P95 latency target: < 3 seconds for the full RAG cycle (embed query → retrieve → generate).

Can I run the full RAG pipeline locally without any API costs?

Yes. Use Ollama for local LLM inference (llama3.2 or mistral) and local embeddings (nomic-embed-text), plus ChromaDB as the vector store. All free. Run `ollama pull llama3.2` and `ollama pull nomic-embed-text`, then replace the OpenAI clients with OllamaEmbeddings and ChatOllama in LangChain. On an M2 MacBook Pro, expect 2-4 tokens/sec for llama3.2 70B, or 15-20 tokens/sec for llama3.2 8B. For the Docker deployment in this tutorial, add an Ollama service to the compose file and point your RAG service to it.