In 2026, building a RAG (Retrieval-Augmented Generation) system with cloud APIs easily costs $500-2000/month for moderate usage. Between embedding costs, vector storage (Pinecone, Qdrant Cloud), and LLM inference (OpenAI, Anthropic), the bill explodes rapidly at scale.
The solution: deploy a 100% local RAG system with Ollama (self-hosted open-source LLM) and ChromaDB (open-source vector database). Result: $0 API costs, reduced latency (no network round-trip), and total control over your sensitive data. Only cost: GPU server ($89-180/month depending on power).
This guide shows you how to move from a RAG prototype with proprietary APIs to an autonomous production system, with complete examples, real benchmarks, and migration experience feedback.
Why Local RAG in 2026?
Cost Analysis: Cloud APIs vs Local Infrastructure
Real case: B2B SaaS company with intelligent customer support chatbot. 1000 active users, 50 questions/day average, knowledge base of 500 documents (product documentation, FAQs, guides).
| Component | Cloud Solution | Cost/month | Local Solution | Cost/month |
|---|
| Embeddings | OpenAI text-embedding-3-small (1.5M tokens/month) | $30 | nomic-embed-text (local) | $0 |
| Vector database | Pinecone Serverless (500k vectors) | $150 | ChromaDB (Docker) | $0 |
| LLM Inference | GPT-4 Turbo (50k questions × 1k tokens avg) | $600 | Llama 3.3 70B (Ollama) | $0 |
| Infrastructure | Application hosting | $50 | Hetzner GPU AX102 (2× RTX 4090, 128GB RAM) | $89 |
| Backup / Monitoring | Logs, metrics | $20 | S3 backups, Prometheus | $20 |
| TOTAL | — | $850/month | — | $109/month |
Savings: -87% ($741/month)
ROI: migration investment recovered in less than 2 weeks
Ideal Use Cases for Local RAG
- Internal customer support: company knowledge base (technical documentation, procedures, FAQs). Sensitive data that must not leave infrastructure.
- Legal contract analysis: searching thousands of contracts, clauses, case law. Strict GDPR, ultra-confidential data.
- Searchable technical documentation: engineers querying codebase, architecture decisions, runbooks. High query volume.
- Academic research: question-answering on corpus of scientific publications, theses, articles. No API budget, need for reproducibility.
- Private medical assistant: searching patient files, medical guidelines. Strict HIPAA/GDPR compliance.
Local RAG Architecture: Overview
A local RAG system consists of 3 main components, all self-hosted:
┌──────────────────────────────────────────────────────────────────┐
│ LOCAL RAG ARCHITECTURE │
└──────────────────────────────────────────────────────────────────┘
OFFLINE INDEXATION (run once, then on each doc update)
─────────────────────────────────────────────────────────────────────
┌─────────────┐
│ Documents │ PDF, Markdown, HTML, DOCX
│ (500 docs) │
└──────┬──────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CHUNKING │
│ LangChain RecursiveCharacterTextSplitter │
│ - chunk_size: 800 tokens │
│ - chunk_overlap: 100 tokens │
│ Output: ~50,000 chunks │
└──────┬──────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ EMBEDDING (LOCAL) │
│ Model: nomic-embed-text (768 dimensions) │
│ Sentence Transformers (GPU accelerated) │
│ Speed: ~500 chunks/sec on RTX 4090 │
│ Total time: ~2 minutes for 50k chunks │
└──────┬──────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CHROMADB STORAGE │
│ Collection: "knowledge_base" │
│ Vectors: 50,000 × 768 dimensions │
│ Metadata: source, page, timestamp │
│ Storage: ~150MB on disk (compressed) │
└─────────────────────────────────────────────────────────────┘
ONLINE QUERY (real-time, latency critical)
─────────────────────────────────────────────
[User Question]
│
▼
┌─────────────────────────────────────────────────────────────┐
│ EMBED QUERY │
│ Same model: nomic-embed-text │
│ Latency: 20-40ms (GPU) / 150-300ms (CPU) │
└──────┬──────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CHROMADB SIMILARITY SEARCH │
│ Cosine similarity, top_k=5 │
│ Latency: 15-30ms (50k vectors in RAM) │
└──────┬──────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CONTEXT CONSTRUCTION │
│ Format: "Based on these documents:\n{chunk1}\n{chunk2}..."│
└──────┬──────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ OLLAMA LLM GENERATION │
│ Model: Llama 3.3 70B (Q8 quantization) │
│ Context window: 128k tokens │
│ Generation speed: 12-15 tokens/sec (RTX 4090) │
│ Latency: 2-5s for complete response │
└──────┬──────────────────────────────────────────────────────┘
│
▼
[Response to user with cited sources]
COMPLETE STACK (Docker Compose)
─────────────────────────────────
- Ollama (LLM inference) : port 11434
- ChromaDB (vector database) : port 8000
- FastAPI (API application) : port 8080
- Prometheus (monitoring) : port 9090
- Grafana (dashboards) : port 3000
Installation: Complete Docker Compose
The entire local RAG infrastructure fits in a single Docker Compose file. Start with one command.
docker-compose.yml
version: '3.8'
services:
# Ollama: local LLM server
ollama:
image: ollama/ollama:latest
container_name: rag-ollama
volumes:
- ollama_models:/root/.ollama
ports:
- "11434:11434"
environment:
- OLLAMA_HOST=0.0.0.0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
# ChromaDB: vector database
chromadb:
image: chromadb/chroma:latest
container_name: rag-chromadb
volumes:
- chromadb_data:/chroma/chroma
ports:
- "8000:8000"
environment:
- IS_PERSISTENT=TRUE
- ANONYMIZED_TELEMETRY=FALSE
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/heartbeat"]
interval: 30s
timeout: 5s
retries: 3
# RAG Application (FastAPI)
rag-api:
build:
context: ./app
dockerfile: Dockerfile
container_name: rag-api
ports:
- "8080:8080"
environment:
- OLLAMA_URL=http://ollama:11434
- CHROMADB_URL=http://chromadb:8000
- EMBEDDING_MODEL=nomic-embed-text
- LLM_MODEL=llama3.3:70b
depends_on:
- ollama
- chromadb
restart: unless-stopped
# Prometheus: monitoring
prometheus:
image: prom/prometheus:latest
container_name: rag-prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
restart: unless-stopped
# Grafana: dashboards
grafana:
image: grafana/grafana:latest
container_name: rag-grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
- GF_INSTALL_PLUGINS=grafana-piechart-panel
volumes:
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
restart: unless-stopped
volumes:
ollama_models:
chromadb_data:
prometheus_data:
grafana_data:
Startup and Configuration
# 1. Clone project (or create structure)
mkdir rag-local && cd rag-local
# Copy docker-compose.yml above
# 2. Start services
docker-compose up -d
# 3. Wait for Ollama to be ready (~20s)
docker-compose logs -f ollama
# Wait for "Ollama is running" message
# 4. Download required models
# LLM for generation
docker exec -it rag-ollama ollama pull llama3.3:70b
# Embedding model (RAG-optimized)
docker exec -it rag-ollama ollama pull nomic-embed-text
# 5. Verify ChromaDB is ready
curl http://localhost:8000/api/v1/heartbeat
# Output: {"nanosecond heartbeat": 1712140800000000000}
# 6. Check GPU usage
watch -n 1 nvidia-smi
# GPU 0 should show "ollama" with ~45GB VRAM used (70B model loaded)
# 7. Quick LLM test
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3:70b",
"prompt": "Explain RAG in 2 simple sentences.",
"stream": false
}'
# Expected output in ~3-4s:
# {
# "model": "llama3.3:70b",
# "response": "RAG (Retrieval-Augmented Generation) retrieves relevant documents from a knowledge base before generating a response with an LLM. This allows the model to answer with up-to-date information without retraining."
# }
Ingestion Pipeline: From PDF to Vectors
Ingestion transforms your raw documents (PDF, Markdown, DOCX) into vectors stored in ChromaDB. This pipeline runs once at startup, then on each knowledge base update.
Complete Code: ingest.py
#!/usr/bin/env python3
"""
Document ingestion pipeline for local RAG.
Reads PDF/Markdown → Chunking → Embeddings → ChromaDB
Usage:
python ingest.py --docs-dir ./documents --collection knowledge_base
"""
import argparse
import os
from pathlib import Path
from typing import List, Dict
import time
# Document loading
from langchain_community.document_loaders import (
PyPDFLoader,
UnstructuredMarkdownLoader,
TextLoader,
)
# Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Local embeddings
from sentence_transformers import SentenceTransformer
# ChromaDB
import chromadb
from chromadb.config import Settings
class LocalRAGIngestion:
def __init__(
self,
chromadb_host: str = "localhost",
chromadb_port: int = 8000,
embedding_model: str = "nomic-ai/nomic-embed-text-v1.5",
):
"""
Initialize ingestion pipeline.
Args:
chromadb_host: ChromaDB host
chromadb_port: ChromaDB port
embedding_model: Embedding model (Hugging Face)
"""
# ChromaDB client
self.chroma_client = chromadb.HttpClient(
host=chromadb_host,
port=chromadb_port,
settings=Settings(anonymized_telemetry=False),
)
# Embedding model (loaded on GPU if available)
print(f"Loading embedding model: {embedding_model}")
self.embedding_model = SentenceTransformer(
embedding_model,
device="cuda", # or "cpu" if no GPU
)
print(f" Dimensions: {self.embedding_model.get_sentence_embedding_dimension()}")
# Text splitter
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
def load_documents(self, docs_dir: str) -> List[Dict]:
"""
Load all documents from a directory.
Supports: .pdf, .md, .txt
Returns:
List of documents with metadata
"""
documents = []
docs_path = Path(docs_dir)
for file_path in docs_path.rglob("*"):
if not file_path.is_file():
continue
try:
if file_path.suffix == ".pdf":
loader = PyPDFLoader(str(file_path))
docs = loader.load()
elif file_path.suffix == ".md":
loader = UnstructuredMarkdownLoader(str(file_path))
docs = loader.load()
elif file_path.suffix == ".txt":
loader = TextLoader(str(file_path))
docs = loader.load()
else:
continue
# Add metadata
for doc in docs:
doc.metadata["source"] = str(file_path)
doc.metadata["file_type"] = file_path.suffix
documents.extend(docs)
print(f"✓ Loaded: {file_path} ({len(docs)} pages/sections)")
except Exception as e:
print(f"✗ Error on {file_path}: {e}")
return documents
def chunk_documents(self, documents: List) -> List[Dict]:
"""
Split documents into semantically coherent chunks.
Returns:
List of chunks with metadata
"""
all_chunks = []
for doc in documents:
chunks = self.text_splitter.split_text(doc.page_content)
for i, chunk_text in enumerate(chunks):
all_chunks.append({
"text": chunk_text,
"metadata": {
**doc.metadata,
"chunk_index": i,
"chunk_length": len(chunk_text),
}
})
return all_chunks
def embed_chunks(self, chunks: List[Dict]) -> List[List[float]]:
"""
Generate embeddings for all chunks.
Uses batch processing to optimize GPU throughput.
"""
texts = [chunk["text"] for chunk in chunks]
print(f"Generating {len(texts)} embeddings...")
start_time = time.time()
# Batch encoding (optimal for GPU)
embeddings = self.embedding_model.encode(
texts,
batch_size=32,
show_progress_bar=True,
normalize_embeddings=True,
)
elapsed = time.time() - start_time
print(f" ✓ Completed in {elapsed:.1f}s ({len(texts)/elapsed:.0f} chunks/sec)")
return embeddings.tolist()
def ingest_to_chromadb(
self,
chunks: List[Dict],
embeddings: List[List[float]],
collection_name: str = "knowledge_base",
):
"""
Insert chunks and embeddings into ChromaDB.
Args:
chunks: List of chunks with metadata
embeddings: Embedding vectors
collection_name: ChromaDB collection name
"""
# Create or get collection
try:
collection = self.chroma_client.get_collection(collection_name)
print(f"Collection '{collection_name}' already exists, will be updated")
except:
collection = self.chroma_client.create_collection(
name=collection_name,
metadata={"description": "RAG knowledge base"}
)
print(f"Collection '{collection_name}' created")
# Prepare data for insertion
ids = [f"chunk_{i}" for i in range(len(chunks))]
documents = [chunk["text"] for chunk in chunks]
metadatas = [chunk["metadata"] for chunk in chunks]
# Insert in batches (ChromaDB limit: 41666 items/batch)
batch_size = 5000
total_batches = (len(ids) + batch_size - 1) // batch_size
print(f"Inserting into ChromaDB ({total_batches} batches)...")
for i in range(0, len(ids), batch_size):
batch_end = min(i + batch_size, len(ids))
collection.upsert(
ids=ids[i:batch_end],
embeddings=embeddings[i:batch_end],
documents=documents[i:batch_end],
metadatas=metadatas[i:batch_end],
)
print(f" Batch {i//batch_size + 1}/{total_batches} inserted")
print(f"✓ {len(ids)} chunks inserted into '{collection_name}'")
def run(self, docs_dir: str, collection_name: str = "knowledge_base"):
"""
Run complete pipeline.
"""
print("=" * 60)
print("LOCAL RAG INGESTION PIPELINE")
print("=" * 60)
# 1. Load documents
print("\n[1/4] Loading documents...")
documents = self.load_documents(docs_dir)
print(f" ✓ {len(documents)} documents loaded")
if len(documents) == 0:
print(" ✗ No documents found. Stopping.")
return
# 2. Chunking
print("\n[2/4] Splitting into chunks...")
chunks = self.chunk_documents(documents)
print(f" ✓ {len(chunks)} chunks created")
# 3. Embeddings
print("\n[3/4] Generating embeddings...")
embeddings = self.embed_chunks(chunks)
# 4. ChromaDB insertion
print("\n[4/4] Inserting into ChromaDB...")
self.ingest_to_chromadb(chunks, embeddings, collection_name)
print("\n" + "=" * 60)
print("INGESTION COMPLETE")
print("=" * 60)
print(f"Collection: {collection_name}")
print(f"Documents: {len(documents)}")
print(f"Chunks: {len(chunks)}")
print(f"ChromaDB storage: http://localhost:8000")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Document ingestion for local RAG")
parser.add_argument(
"--docs-dir",
type=str,
required=True,
help="Directory containing documents (PDF, MD, TXT)"
)
parser.add_argument(
"--collection",
type=str,
default="knowledge_base",
help="ChromaDB collection name (default: knowledge_base)"
)
parser.add_argument(
"--chromadb-host",
type=str,
default="localhost",
help="ChromaDB host (default: localhost)"
)
parser.add_argument(
"--chromadb-port",
type=int,
default=8000,
help="ChromaDB port (default: 8000)"
)
args = parser.parse_args()
ingestion = LocalRAGIngestion(
chromadb_host=args.chromadb_host,
chromadb_port=args.chromadb_port,
)
ingestion.run(
docs_dir=args.docs_dir,
collection_name=args.collection,
)
Pipeline Execution
# 1. Install Python dependencies
pip install langchain langchain-community sentence-transformers \
chromadb unstructured pypdf
# 2. Prepare documents
mkdir -p documents
# Copy your PDFs, Markdown, TXT into ./documents/
# 3. Run ingestion
python ingest.py --docs-dir ./documents --collection knowledge_base
# Expected output:
# ============================================================
# LOCAL RAG INGESTION PIPELINE
# ============================================================
#
# [1/4] Loading documents...
# ✓ Loaded: documents/product_guide.pdf (127 pages/sections)
# ✓ Loaded: documents/api_reference.md (1 pages/sections)
# ✓ Loaded: documents/faq.txt (1 pages/sections)
# ✓ 129 documents loaded
#
# [2/4] Splitting into chunks...
# ✓ 4,847 chunks created
#
# [3/4] Generating embeddings...
# Loading embedding model: nomic-ai/nomic-embed-text-v1.5
# Dimensions: 768
# Generating 4847 embeddings...
# 100%|██████████████████████████████████| 4847/4847 [00:09<00:00, 512.34it/s]
# ✓ Completed in 9.5s (510 chunks/sec)
#
# [4/4] Inserting into ChromaDB...
# Collection 'knowledge_base' created
# Inserting into ChromaDB (1 batches)...
# Batch 1/1 inserted
# ✓ 4847 chunks inserted into 'knowledge_base'
#
# ============================================================
# INGESTION COMPLETE
# ============================================================
# Collection: knowledge_base
# Documents: 129
# Chunks: 4847
# ChromaDB storage: http://localhost:8000
# 4. Verify in ChromaDB
curl http://localhost:8000/api/v1/collections/knowledge_base | jq
# Output:
# {
# "name": "knowledge_base",
# "id": "...",
# "metadata": {"description": "RAG knowledge base"},
# "count": 4847
# }
Query API: FastAPI with Semantic Search
The API exposes a /query endpoint that orchestrates vector search (ChromaDB) and generation (Ollama).
Complete Code: app/main.py
#!/usr/bin/env python3
"""
Local RAG API with FastAPI + ChromaDB + Ollama
Endpoints:
POST /query - Ask a question
GET /health - Health check
"""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Dict, Optional
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
import ollama
import time
app = FastAPI(title="Local RAG API")
# Configuration (externalize as env vars in production)
CHROMADB_URL = "http://chromadb:8000"
OLLAMA_URL = "http://ollama:11434"
EMBEDDING_MODEL = "nomic-ai/nomic-embed-text-v1.5"
LLM_MODEL = "llama3.3:70b"
COLLECTION_NAME = "knowledge_base"
# Initialize clients (at startup)
chroma_client = chromadb.HttpClient(
host="chromadb",
port=8000,
settings=Settings(anonymized_telemetry=False),
)
embedding_model = SentenceTransformer(EMBEDDING_MODEL, device="cpu")
ollama_client = ollama.Client(host=OLLAMA_URL)
class QueryRequest(BaseModel):
question: str
top_k: int = 5
include_sources: bool = True
class QueryResponse(BaseModel):
answer: str
sources: Optional[List[Dict]] = None
latency_ms: Dict[str, float]
@app.post("/query", response_model=QueryResponse)
async def query_rag(request: QueryRequest):
"""
Main RAG endpoint: semantic search + generation.
Args:
question: User question
top_k: Number of chunks to retrieve (default: 5)
include_sources: Include sources in response
Returns:
Generated answer with sources and latency metrics
"""
timings = {}
start_total = time.time()
try:
# 1. Embed question
start_embed = time.time()
question_embedding = embedding_model.encode(
[request.question],
normalize_embeddings=True,
)[0].tolist()
timings["embed_query"] = (time.time() - start_embed) * 1000
# 2. Vector search in ChromaDB
start_search = time.time()
collection = chroma_client.get_collection(COLLECTION_NAME)
results = collection.query(
query_embeddings=[question_embedding],
n_results=request.top_k,
include=["documents", "metadatas", "distances"],
)
timings["vector_search"] = (time.time() - start_search) * 1000
# 3. Build context for LLM
if len(results["documents"][0]) == 0:
raise HTTPException(
status_code=404,
detail="No relevant documents found in knowledge base"
)
context_chunks = []
sources = []
for i, (doc, metadata, distance) in enumerate(zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
)):
context_chunks.append(f"[Document {i+1}]\n{doc}")
if request.include_sources:
sources.append({
"rank": i + 1,
"source": metadata.get("source", "unknown"),
"similarity": 1 - distance, # Convert distance to similarity
"preview": doc[:200] + "..." if len(doc) > 200 else doc,
})
context = "\n\n".join(context_chunks)
# 4. Generate answer with Ollama
start_llm = time.time()
prompt = f"""You are a technical assistant who answers questions based ONLY on the provided documents.
Reference documents:
{context}
User question: {request.question}
Instructions:
- Answer concisely and precisely
- Base your answer ONLY on the provided documents
- If the information is not in the documents, say "I cannot find this information in the knowledge base"
- Cite document numbers used (e.g., "According to Document 2...")
Answer:"""
response = ollama_client.chat(
model=LLM_MODEL,
messages=[
{
"role": "user",
"content": prompt
}
],
options={
"temperature": 0.1, # Low creativity to stay factual
"num_ctx": 4096, # Context window
}
)
answer = response["message"]["content"]
timings["llm_generation"] = (time.time() - start_llm) * 1000
timings["total"] = (time.time() - start_total) * 1000
return QueryResponse(
answer=answer,
sources=sources if request.include_sources else None,
latency_ms=timings,
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health_check():
"""
Health check endpoint.
Verifies ChromaDB and Ollama are accessible.
"""
health = {
"status": "healthy",
"chromadb": "unknown",
"ollama": "unknown",
}
try:
chroma_client.heartbeat()
health["chromadb"] = "ok"
except:
health["chromadb"] = "error"
health["status"] = "degraded"
try:
ollama_client.list()
health["ollama"] = "ok"
except:
health["ollama"] = "error"
health["status"] = "degraded"
return health
@app.get("/")
async def root():
return {
"service": "Local RAG API",
"version": "1.0.0",
"endpoints": {
"query": "POST /query",
"health": "GET /health",
}
}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8080)
API Testing
# 1. API should already be running via Docker Compose
# Otherwise, start manually:
# cd app && uvicorn main:app --host 0.0.0.0 --port 8080
# 2. Health check
curl http://localhost:8080/health | jq
# Output:
# {
# "status": "healthy",
# "chromadb": "ok",
# "ollama": "ok"
# }
# 3. Ask a question
curl -X POST http://localhost:8080/query \
-H "Content-Type: application/json" \
-d '{
"question": "How to configure JWT authentication?",
"top_k": 5,
"include_sources": true
}' | jq
# Output (after 2-4s):
# {
# "answer": "According to Document 1, to configure JWT authentication, you must first install the PyJWT library (...rest of answer...)",
# "sources": [
# {
# "rank": 1,
# "source": "documents/api_reference.md",
# "similarity": 0.847,
# "preview": "## JWT Authentication\n\nOur API uses JSON Web Tokens (JWT) for authentication. Here's how to configure..."
# },
# {
# "rank": 2,
# "source": "documents/security_guide.pdf",
# "similarity": 0.812,
# "preview": "JWT tokens must be securely stored on the client side..."
# }
# ],
# "latency_ms": {
# "embed_query": 42.3,
# "vector_search": 18.7,
# "llm_generation": 2847.5,
# "total": 2908.5
# }
# }
# 4. Complex questions
curl -X POST http://localhost:8080/query \
-H "Content-Type: application/json" \
-d '{
"question": "What is the difference between Pro and Enterprise plans in terms of rate limiting and SLA?",
"top_k": 3
}' | jq '.answer'
# The LLM will synthesize information from multiple documents
# to provide a complete answer
Benchmarks: Local RAG vs Cloud APIs Performance
Comparison on a corpus of 500 documents (50,000 chunks), 1000 test questions, RTX 4090 GPU.
Latency (p50 / p95)
| Step | Local RAG (Ollama + ChromaDB) | Cloud RAG (OpenAI + Pinecone) |
|---|
| Embedding query | 25ms / 45ms (GPU) 180ms / 320ms (CPU) | 120ms / 280ms (API + network latency) |
| Vector search | 18ms / 32ms | 65ms / 140ms (serverless + network) |
| LLM generation | 2.8s / 4.5s (Llama 3.3 70B) 0.9s / 1.6s (Llama 3.3 8B) | 2.1s / 3.8s (GPT-4 Turbo) |
| Total end-to-end | 2.85s / 4.6s (70B GPU) 1.1s / 1.8s (8B GPU) | 2.3s / 4.2s |
Observation: Local RAG is 20-30% slower at p50 with Llama 70B (mainly due to generation), but equivalent or faster with Llama 8B. At p95, similar performance.
Retrieval and Generation Quality
| Metric | Local (nomic-embed + Llama 70B) | Cloud (text-emb-3-small + GPT-4) |
|---|
| Recall@5 (retrieval) | 89.3% | 91.7% |
| MRR (Mean Reciprocal Rank) | 0.81 | 0.84 |
| Answer accuracy (human eval) | 87% | 91% |
| Hallucinations (% fabricated answers) | 8% | 5% |
Conclusion: GPT-4 remains slightly superior in absolute quality (~4% gap), but Llama 3.3 70B is more than sufficient for 85% of use cases. The cost/quality trade-off massively favors local.
Costs at Scale
| Volume (queries/month) | Local (Ollama + ChromaDB) | Cloud (OpenAI + Pinecone) | Savings |
|---|
| 10,000 | $109 (fixed server) | $180 | -39% |
| 50,000 | $109 | $850 | -87% |
| 200,000 | $180 (GPU cloud upgrade) | $3,400 | -95% |
| 1,000,000 | $450 (2 GPU servers) | $17,000 | -97% |
Break-even point: from 10,000 queries/month, local becomes more cost-effective. At 50,000 queries/month, savings reach 87% ($741/month).
Real Case: Customer Support RAG Migration
Context: B2B SaaS (project management), customer support chatbot powered by knowledge base of 800 articles. 1200 active users, ~80 questions/day.
Initial infrastructure (cloud APIs):
- Embeddings: OpenAI text-embedding-3-small
- Vector DB: Pinecone Serverless (800k vectors)
- LLM: GPT-4 Turbo
- Monthly cost: $920 ($650 GPT-4, $190 Pinecone, $80 embeddings)
Migration to local:
- Embeddings: nomic-embed-text (self-hosted)
- Vector DB: ChromaDB (Docker)
- LLM: Llama 3.3 70B via Ollama
- Infra: Hetzner AX102 ($89/month) + S3 backups ($15/month)
- Monthly cost: $104
Results after 3 months:
| Metric | Before (Cloud) | After (Local) | Change |
|---|
| Monthly cost | $920 | $104 | -89% ✅ |
| Latency p50 | 2.4s | 2.9s | +21% ⚠️ |
| Resolution rate | 84% | 82% | -2% ⚠️ |
| User satisfaction (CSAT) | 4.2/5 | 4.1/5 | -2% ≈ |
| Uptime | 99.8% | 99.9% | +0.1% ✅ |
| GDPR compliance | Partial (data in US) | Full (EU only) | ✅ |
CTO feedback:
"The migration to Ollama + ChromaDB saved us $2,448 over 3 months, with immediate ROI (migration time: 4 engineer-days). The slight quality drop (-2% resolution rate) is imperceptible to our users — confirmed by A/B test over 2 weeks. Unexpected bonus: simplified GDPR compliance, all data stays in EU. We keep a GPT-4 instance as fallback for <5% of ultra-complex questions."
Production Checklist: Local RAG
- ✅ Sufficient GPU: minimum RTX 4090 24GB for Llama 70B, or 2× RTX 3090 in parallel
- ✅ Automated backup: daily ChromaDB snapshots to S3/Backblaze
- ✅ Active monitoring: Prometheus + Grafana with alerts on latency > 5s and recall < 85%
- ✅ Golden test set: minimum 100 questions with expected answers, weekly evaluation
- ✅ Redis cache: to reduce LLM load on frequent queries
- ✅ Rate limiting: 60 requests/min per IP, DDoS protection
- ✅ Error handling: retry logic on Ollama (30s timeout), graceful fallback if ChromaDB down
- ✅ Structured logs: JSON logs with trace IDs, integration with ELK/Loki
- ✅ CI/CD: automated re-ingestion pipeline on each commit to docs/
- ✅ Documentation: architecture diagram, incident runbook, migration guide
Resources and Training
To master production RAG and optimize your local AI infrastructure, our Claude API for Developers training covers advanced RAG architectures (reranking, hybrid search, multi-modal), cloud→local migration strategies, and monitoring patterns. 3-day training, OPCO eligible.
We also offer a specialized "Production RAG: From Prototype to Scale" module (2 days) with hands-on Ollama, ChromaDB, and GPU optimizations. Contact us via the contact form.
Frequently Asked Questions
Why ChromaDB over Pinecone or Qdrant for local RAG?
ChromaDB is designed to be embedded in your Python application, with no separate server needed in development. For production, it offers a lightweight client-server mode with Docker. Unlike Pinecone (cloud-only), ChromaDB is 100% free and open-source. Compared to Qdrant, ChromaDB has a simpler API to get started, but Qdrant performs better at very large scale (>10M vectors).
Ollama + ChromaDB vs OpenAI API + Pinecone: real cost difference?
For 1M tokens/month (500 active users): OpenAI API + Pinecone = ~$800/month ($600 GPT-4 tokens + $150 Pinecone + $50 embeddings). Local Ollama + ChromaDB = ~$109/month (Hetzner GPU server $89 + $20 backup). Savings: 86%. For 10M tokens/month: $8000/month vs $180/month (L4 GPU cloud). Immediate ROI from 100k tokens/day.
Which embedding model to use with local Ollama?
For local embeddings: nomic-embed-text (768 dimensions, RAG-optimized, runs on CPU). For better quality: BAAI/bge-large-en-v1.5 (1024 dimensions, needs GPU for good latency). For multilingual: intfloat/multilingual-e5-large. All are free and run via sentence-transformers. Performance: nomic-embed-text reaches ~90% of OpenAI's text-embedding-3-small quality for $0.
How many documents can ChromaDB handle in production?
ChromaDB comfortably handles up to 1M vectors on a server with 8GB RAM. For 1-10M vectors: 16GB RAM recommended. Beyond 10M: consider Qdrant or Weaviate for better performance. Reference: 500 PDF documents (200 pages each) = ~500k chunks after splitting = ~2GB ChromaDB vector storage.
What latency to expect from 100% local RAG vs cloud APIs?
Local RAG (Ollama + ChromaDB, RTX 4090 GPU): vector search 15-30ms, LLM generation 2-5s (Llama 3.3 70B), total 2-5.5s. Cloud RAG (OpenAI + Pinecone): search 50-80ms (network latency included), generation 1.5-3s (GPT-4 Turbo), total 1.6-3.5s. Trade-off: local = 30-40% slower but 95% lower cost and 100% data privacy. For critical latency: use Llama 3.3 8B (generation <1s).