Retrieval-Augmented Generation (RAG) has become the de facto standard for building AI applications that leverage proprietary knowledge in 2026. Unlike fine-tuning, RAG allows injecting up-to-date data directly at inference time, without retraining the model.
This guide gives you the keys to go from a RAG prototype to a robust production system: chunking strategy selection, embedding model selection with real latency data, vector database comparison with benchmarks, reranking patterns, and monitoring architecture.
RAG Architecture: Overview
A production RAG pipeline consists of two distinct phases:
Phase 1: Indexing (Offline)
- Ingestion: loading source documents (PDF, Markdown, HTML, databases)
- Chunking: splitting into semantically coherent fragments (200-800 tokens)
- Embedding: transformation into vectors via an embedding model
- Storage: insertion into a vector database with metadata
Phase 2: Retrieval and Generation (Online)
- Query embedding: transformation of user question into vector
- Semantic similarity: search for k closest chunks (ANN search)
- Reranking (optional): reordering results via cross-encoding model
- Generation: send to LLM with retrieved chunks as context
# End-to-end RAG architecture simplified
┌─────────────────┐
│ Documents │ PDF, Markdown, API data
└────────┬────────┘
│
▼
┌─────────────────┐
│ Chunking │ LangChain RecursiveCharacterTextSplitter
└────────┬────────┘
│
▼
┌─────────────────┐
│ Embedding │ text-embedding-3-small (OpenAI)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Vector DB │ Qdrant / Pinecone / pgvector
└─────────────────┘
[User Query] ──> [Embed] ──> [Similarity Search] ──> [Rerank] ──> [LLM + Context] ──> [Response]
Chunking Strategies: Comparison
Chunking is the most critical step. Poor splitting destroys retrieval quality, regardless of your embedding model's performance.
1. Fixed-Size Chunking
Split by number of characters or tokens, with optional overlap.
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=512, # 512 tokens ≈ 2000 characters
chunk_overlap=50, # 10% overlap to avoid cutting in the middle of an idea
separator="\n\n" # Split on paragraphs first if possible
)
chunks = splitter.split_text(document_text)
# Example result:
# Chunk 1: tokens 0-512
# Chunk 2: tokens 462-974 (50 token overlap)
# Chunk 3: tokens 924-1436
Advantages: simple, fast, predictable.
Disadvantages: can cut mid-sentence, ignores semantic structure.
Use cases: homogeneous technical documentation, structured logs, FAQs.
2. Recursive Character Text Splitting (LangChain)
Hierarchical splitting that attempts to respect document structure (paragraphs, sentences, words).
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
separators=["\n\n", "\n", ". ", " ", ""], # Priority order
length_function=len,
)
chunks = splitter.split_text(document_text)
# Process:
# 1. Try to cut on \n\n (double newline)
# 2. If chunk too large, cut on \n (single newline)
# 3. If still too large, cut on ". " (end of sentence)
# 4. If still too large, cut on space
# 5. As last resort, cut character by character
Advantages: respects natural text structure, more coherent chunks.
Disadvantages: slightly slower than fixed-size, variable chunk size.
Use cases: blog articles, narrative documentation, contracts, reports.
3. Semantic Chunking
Splitting based on semantic similarity between consecutive sentences. Groups sentences that discuss the same topic.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile", # Cut when similarity < percentile
breakpoint_threshold_amount=80, # 80th percentile
)
chunks = splitter.split_text(document_text)
# Internal process:
# 1. Split text into sentences
# 2. Calculate embedding for each sentence
# 3. Measure cosine similarity between consecutive sentences
# 4. Cut when similarity drops below threshold
# 5. Result: variable-size but semantically coherent chunks
Advantages: semantically coherent chunks, better retrieval quality.
Disadvantages: slow (requires embedding each sentence), expensive, variable size.
Use cases: complex knowledge bases, books, academic research.
Chunking Benchmark: Recall@5 on 1000 Questions
| Strategy | Recall@5 | Indexing Latency | Cost (1M tokens) |
|---|
| Fixed-Size (512 tokens) | 76% | 2.3s | $0.13 |
| Recursive (800 tokens) | 84% | 2.8s | $0.13 |
| Semantic Chunking | 91% | 47s | $2.40 |
Recommendation: use RecursiveCharacterTextSplitter for 90% of cases. Only switch to semantic chunking if your recall metric is blocking and you have the budget.
Embedding Models: 2026 Comparison
The choice of embedding model directly impacts retrieval quality, latency, and cost. Here are the reference models in 2026 with real benchmarks.
Comparison Table
| Model | Dimensions | MTEB Score | Latency (1k tokens) | Cost / 1M tokens | Multilingual |
|---|
| text-embedding-3-small (OpenAI) | 1536 | 62.3 | 45ms | $0.02 | ✅ |
| text-embedding-3-large (OpenAI) | 3072 | 64.6 | 78ms | $0.13 | ✅ |
| embed-english-v3.0 (Cohere) | 1024 | 64.5 | 52ms | $0.10 | ❌ |
| embed-multilingual-v3.0 (Cohere) | 1024 | 66.3 | 58ms | $0.10 | ✅ (100+ languages) |
| BAAI/bge-large-en-v1.5 (Open-source) | 1024 | 63.2 | 120ms (CPU) / 12ms (GPU) | $0 (self-hosted) | ❌ |
| text-embedding-ada-002 (OpenAI, deprecated) | 1536 | 60.9 | 68ms | $0.10 | ✅ |
MTEB (Massive Text Embedding Benchmark): aggregated score across 56 retrieval, classification, clustering datasets. Higher is better.
Implementation Code: OpenAI Embeddings
import openai
from typing import List
openai.api_key = "sk-..."
def embed_texts(texts: List[str], model: str = "text-embedding-3-small") -> List[List[float]]:
"""
Generate embeddings for a list of texts.
Args:
texts: List of strings to embed (max 2048 tokens/string)
model: Embedding model to use
Returns:
List of vectors (each vector = list of floats)
"""
response = openai.embeddings.create(
input=texts,
model=model
)
return [item.embedding for item in response.data]
# Usage example
chunks = [
"RAG allows injecting up-to-date knowledge into an LLM.",
"Chunking is the critical step of a RAG pipeline.",
"Vector databases store embeddings for ANN search."
]
embeddings = embed_texts(chunks)
print(f"Generated {len(embeddings)} vectors of dimension {len(embeddings[0])}")
# Output: Generated 3 vectors of dimension 1536
Implementation Code: Self-Hosted with Sentence Transformers
from sentence_transformers import SentenceTransformer
from typing import List
import numpy as np
# Load model (once at startup)
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
def embed_texts(texts: List[str]) -> np.ndarray:
"""
Generate embeddings with an open-source self-hosted model.
Args:
texts: List of strings to embed
Returns:
Numpy array of shape (len(texts), 1024)
"""
# encode() automatically normalizes vectors (L2 norm = 1)
embeddings = model.encode(
texts,
normalize_embeddings=True,
show_progress_bar=False
)
return embeddings
# Example with batch processing to optimize throughput
chunks = ["..." for _ in range(1000)] # 1000 chunks
# GPU: 1000 chunks in ~1.2s (batch_size=32)
# CPU: 1000 chunks in ~12s (batch_size=8)
embeddings = embed_texts(chunks)
print(f"Shape: {embeddings.shape}") # (1000, 1024)
print(f"Type: {type(embeddings)}") # numpy.ndarray
Recommendation: for an MVP or small team, use OpenAI's text-embedding-3-small (5 minute setup, good quality/price ratio). To reduce costs at scale (>10M chunks), switch to a self-hosted model like BAAI/bge on GPU.
Vector Databases: Pinecone vs Qdrant vs Weaviate
The choice of vector database depends on your scale, budget, and tolerance for infrastructure management.
Functional and Performance Comparison
| Criterion | Pinecone | Qdrant | Weaviate | pgvector (PostgreSQL) |
|---|
| Deployment | Serverless (managed) | Docker / K8s / managed | Docker / K8s / managed | PostgreSQL extension |
| Latency (p95, 1M vectors) | 18ms | 12ms | 15ms | 45ms |
| Cost (1M vectors, 1536 dim) | $70/month | $25/month (self-hosted) | $30/month (self-hosted) | $0 (if existing PostgreSQL) |
| Max scale (vectors) | Several billions | Several billions | Several billions | ~10M (degraded performance after) |
| Metadata filtering | ✅ (limited) | ✅ (very flexible) | ✅ (GraphQL) | ✅ (native SQL) |
| Hybrid search (sparse + dense) | ❌ | ✅ | ✅ | ❌ |
| Setup time | 5 min | 30 min (Docker) | 30 min (Docker) | 10 min (extension) |
Code: Pinecone (Serverless)
from pinecone import Pinecone, ServerlessSpec
import openai
# 1. Initialization
pc = Pinecone(api_key="pcsk_...")
openai.api_key = "sk-..."
# 2. Create index (once)
index_name = "rag-production"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536, # Vector dimension (text-embedding-3-small)
metric="cosine", # Cosine similarity
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
index = pc.Index(index_name)
# 3. Insert vectors
chunks = ["Chunk 1 content...", "Chunk 2 content..."]
embeddings_response = openai.embeddings.create(
input=chunks,
model="text-embedding-3-small"
)
embeddings = [item.embedding for item in embeddings_response.data]
# Upsert (insert or update)
vectors_to_upsert = [
{
"id": f"chunk-{i}",
"values": embedding,
"metadata": {
"text": chunk,
"source": "documentation.md",
"timestamp": "2026-04-02"
}
}
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
]
index.upsert(vectors=vectors_to_upsert)
# 4. Similarity search
query = "How does chunking work?"
query_embedding_response = openai.embeddings.create(
input=[query],
model="text-embedding-3-small"
)
query_embedding = query_embedding_response.data[0].embedding
results = index.query(
vector=query_embedding,
top_k=5, # Retrieve 5 closest chunks
include_metadata=True # Include metadata (text, source)
)
for match in results.matches:
print(f"Score: {match.score:.4f}")
print(f"Text: {match.metadata['text']}")
print(f"Source: {match.metadata['source']}\n")
Code: Qdrant (Self-Hosted)
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import openai
# 1. Connect to Qdrant (local Docker or cloud)
client = QdrantClient(url="http://localhost:6333") # or cloud URL
openai.api_key = "sk-..."
# 2. Create collection (once)
collection_name = "rag_production"
client.recreate_collection(
collection_name=collection_name,
vectors_config=VectorParams(
size=1536, # Dimension
distance=Distance.COSINE
)
)
# 3. Insert vectors
chunks = ["Chunk 1...", "Chunk 2..."]
embeddings_response = openai.embeddings.create(
input=chunks,
model="text-embedding-3-small"
)
embeddings = [item.embedding for item in embeddings_response.data]
points = [
PointStruct(
id=i,
vector=embedding,
payload={
"text": chunk,
"source": "documentation.md",
"category": "technical"
}
)
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
]
client.upsert(collection_name=collection_name, points=points)
# 4. Search with metadata filtering
query = "How does chunking work?"
query_embedding = openai.embeddings.create(
input=[query],
model="text-embedding-3-small"
).data[0].embedding
results = client.search(
collection_name=collection_name,
query_vector=query_embedding,
limit=5,
query_filter={ # Filter on metadata
"must": [
{"key": "category", "match": {"value": "technical"}}
]
}
)
for hit in results:
print(f"Score: {hit.score:.4f}")
print(f"Text: {hit.payload['text']}")
print(f"Source: {hit.payload['source']}\n")
Benchmark: Query Latency (p95)
# Conditions:
# - 1 million vectors (1536 dimensions)
# - top_k = 5
# - p95 measurement over 10,000 queries
Pinecone Serverless (us-east-1): 18ms
Qdrant (self-hosted, 4 vCPU): 12ms
Weaviate (self-hosted, 4 vCPU): 15ms
pgvector (PostgreSQL 15, HNSW): 45ms
# For 10M vectors:
Pinecone: 22ms
Qdrant: 16ms
Weaviate: 19ms
pgvector: 340ms (not recommended at this scale)
Recommendation: Pinecone for MVP and teams without DevOps. Qdrant for production if you want the best performance/price ratio. pgvector if you already have PostgreSQL and <1M vectors.
Reranking and Hybrid Search
Pure vector search (dense retrieval) is not always optimal. Two advanced patterns significantly improve recall.
Pattern 1: Reranking with Cross-Encoder
After vector search, a cross-encoding model re-evaluates each (query, document) pair to reorder results. More precise than simple cosine similarity, but slower.
from sentence_transformers import CrossEncoder
import openai
from qdrant_client import QdrantClient
# 1. Initial vector search (top_k = 20 instead of 5)
query = "What is the difference between RAG and fine-tuning?"
query_embedding = openai.embeddings.create(
input=[query],
model="text-embedding-3-small"
).data[0].embedding
client = QdrantClient(url="http://localhost:6333")
candidates = client.search(
collection_name="rag_production",
query_vector=query_embedding,
limit=20 # Retrieve more candidates
)
# 2. Reranking with Cross-Encoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Create (query, document) pairs
pairs = [[query, hit.payload['text']] for hit in candidates]
# Calculate relevance scores (between 0 and 1)
rerank_scores = reranker.predict(pairs)
# Sort by descending score
ranked_results = sorted(
zip(candidates, rerank_scores),
key=lambda x: x[1],
reverse=True
)
# Keep top 5 after reranking
top_5 = ranked_results[:5]
for hit, score in top_5:
print(f"Rerank Score: {score:.4f}")
print(f"Text: {hit.payload['text']}\n")
# Impact on Recall@5:
# Without reranking: 84%
# With reranking: 92%
# Trade-off: +60ms latency
Pattern 2: Hybrid Search (BM25 + Vector)
Combines lexical search (BM25, based on word frequency) with semantic search (vectors). Particularly effective when the query contains precise keywords (proper nouns, acronyms, technical terms).
from qdrant_client import QdrantClient, models
client = QdrantClient(url="http://localhost:6333")
# Qdrant natively supports hybrid search
# Need to enable sparse indexing when creating the collection
collection_name = "rag_hybrid"
client.recreate_collection(
collection_name=collection_name,
vectors_config={
"dense": models.VectorParams(size=1536, distance=models.Distance.COSINE)
},
sparse_vectors_config={
"sparse": models.SparseVectorParams()
}
)
# Insert with both sparse (BM25) and dense (embeddings) vectors
from qdrant_client.models import SparseVector
points = [
models.PointStruct(
id=0,
vector={
"dense": embedding_dense, # Classic embedding vector
"sparse": SparseVector(
indices=[45, 128, 3421], # IDs of present tokens
values=[0.8, 0.6, 0.4] # TF-IDF or BM25 weights
)
},
payload={"text": chunk_text}
)
]
client.upsert(collection_name=collection_name, points=points)
# Hybrid search
results = client.search(
collection_name=collection_name,
query_vector=models.NamedVector(
name="dense",
vector=query_embedding
),
query_sparse_vector=models.NamedSparseVector(
name="sparse",
vector=query_sparse_vector
),
limit=5
)
# Qdrant automatically merges scores (Reciprocal Rank Fusion)
Recall Impact: hybrid search improves recall by 8-12% on queries containing proper nouns or acronyms, at the cost of increased complexity.
Production Monitoring and Metrics
A production RAG system requires continuous monitoring of retrieval quality and latency. Here are the key metrics to track.
Retrieval Quality Metrics
- Recall@k: proportion of relevant documents retrieved in top k results. Aim for >90% at k=5.
- MRR (Mean Reciprocal Rank): average position of first relevant document. Aim for >0.8.
- NDCG@k (Normalized Discounted Cumulative Gain): measures ranking quality considering position. Aim for >0.85.
Latency Metrics
- Embedding latency (p95): time to embed query. Target: <100ms.
- Vector search latency (p95): search time in vector database. Target: <50ms.
- Reranking latency (p95): reranking time (if enabled). Target: <200ms.
- End-to-end latency (p95): total retrieval time. Target: <300ms.
Code: Tracking with Golden Test Set
import json
from typing import List, Dict
from dataclasses import dataclass
@dataclass
class GoldenTestCase:
query: str
relevant_doc_ids: List[str] # IDs of relevant documents
# Golden test set: questions with expected answers
golden_tests = [
GoldenTestCase(
query="What is the difference between RAG and fine-tuning?",
relevant_doc_ids=["doc-42", "doc-128", "doc-391"]
),
# ... 100+ test cases
]
def calculate_recall_at_k(retrieved_ids: List[str], relevant_ids: List[str], k: int) -> float:
"""
Calculate Recall@k: proportion of relevant documents retrieved in top k.
"""
retrieved_k = set(retrieved_ids[:k])
relevant_set = set(relevant_ids)
if len(relevant_set) == 0:
return 0.0
return len(retrieved_k & relevant_set) / len(relevant_set)
def calculate_mrr(retrieved_ids: List[str], relevant_ids: List[str]) -> float:
"""
Calculate MRR: inverse of the rank of the first relevant document.
"""
for i, doc_id in enumerate(retrieved_ids, start=1):
if doc_id in relevant_ids:
return 1.0 / i
return 0.0
# Evaluation on golden test set
recalls = []
mrrs = []
for test in golden_tests:
# Retrieval with your RAG system
results = rag_system.retrieve(test.query, top_k=10)
retrieved_ids = [r.id for r in results]
recall_5 = calculate_recall_at_k(retrieved_ids, test.relevant_doc_ids, k=5)
mrr = calculate_mrr(retrieved_ids, test.relevant_doc_ids)
recalls.append(recall_5)
mrrs.append(mrr)
# Aggregated metrics
print(f"Recall@5: {sum(recalls) / len(recalls):.2%}")
print(f"MRR: {sum(mrrs) / len(mrrs):.3f}")
# Alerts:
# - If Recall@5 < 85%: investigate degradation
# - If MRR < 0.75: good documents not in first position
Code: Latency Tracking with OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import time
# Setup OpenTelemetry (export to Datadog, Grafana, etc.)
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
def retrieve_with_tracing(query: str) -> List[Dict]:
with tracer.start_as_current_span("rag.retrieve") as span:
span.set_attribute("query", query)
# 1. Embedding
with tracer.start_as_current_span("rag.embed_query"):
start = time.time()
query_embedding = embed_query(query)
span.set_attribute("latency_ms", (time.time() - start) * 1000)
# 2. Vector search
with tracer.start_as_current_span("rag.vector_search"):
start = time.time()
candidates = vector_db.search(query_embedding, top_k=20)
span.set_attribute("latency_ms", (time.time() - start) * 1000)
span.set_attribute("candidates_count", len(candidates))
# 3. Reranking
with tracer.start_as_current_span("rag.rerank"):
start = time.time()
results = rerank(query, candidates, top_k=5)
span.set_attribute("latency_ms", (time.time() - start) * 1000)
return results
# Spans are automatically exported to your monitoring backend
# You can then create dashboards with p50, p95, p99 of each step
Reference Architecture: RAG Production 2026
Here's a complete architecture for a production RAG system, with redundancy, monitoring, and cost management.
┌────────────────────────────────────────────────────────────────────┐
│ PRODUCTION RAG ARCHITECTURE │
└────────────────────────────────────────────────────────────────────┘
┌─────────────────┐
│ User Query │
└────────┬────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ API Gateway (FastAPI) │
│ - Rate limiting (100 req/min per user) │
│ - Authentication (JWT) │
│ - Request validation │
└────────┬────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ RAG Orchestrator Service │
│ │
│ 1. Query Analysis │
│ - Intent detection (factual vs conversational) │
│ - Language detection │
│ │
│ 2. Retrieval Pipeline │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Embed │───▶│ Vector Search│───▶│ Rerank │ │
│ │ Query │ │ (Qdrant) │ │(optional)│ │
│ └─────────────┘ └──────────────┘ └──────────┘ │
│ │
│ 3. Context Construction │
│ - Top 5 chunks → formatted prompt │
│ - Add metadata (source, timestamp) │
│ │
│ 4. LLM Generation │
│ - Claude 4.5 Sonnet (200k context) │
│ - Streaming response │
│ │
└────────┬────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE │
│ │
│ Vector DB: Qdrant (6 vCPU, 16GB RAM, 100M vectors) │
│ Embedding: OpenAI text-embedding-3-small (API) │
│ LLM: Claude 4.5 Sonnet (Anthropic API) │
│ Cache: Redis (query embeddings, LRU, 1GB) │
│ Monitoring: Datadog (traces, metrics, logs) │
│ Alerting: PagerDuty (latency > 500ms, recall < 85%) │
│ │
└─────────────────────────────────────────────────────────────────────┘
OFFLINE INDEXATION PIPELINE (runs every 6 hours):
Documents (S3) ──▶ Chunking ──▶ Embedding ──▶ Qdrant Upsert
(LangChain) (batch 100) (atomic swap)
COST BREAKDOWN (10M queries/month, 50M vectors):
- Qdrant (self-hosted AWS EC2): $120/month
- OpenAI embeddings (queries only): $200/month
- Claude API (generation): $3,000/month
- Infrastructure (EC2, S3, Redis): $250/month
─────────────────────────────────────────────────
TOTAL: $3,570/month
Cost per query: $0.00036
Production Deployment Checklist
Before deploying your RAG system to production, validate these points.
- ✅ Golden test set: at least 50 questions with expected answers
- ✅ Recall@5 > 85% measured on golden test set
- ✅ p95 Latency < 500ms end-to-end (embedding + search + rerank + LLM)
- ✅ Active monitoring: traces, recall metrics, degradation alerts
- ✅ Rate limiting: protection against abuse (100 req/min per user)
- ✅ Error handling: retry logic on API calls (embedding, LLM), fallback if vector DB down
- ✅ Cache: Redis for frequent queries (reduces embedding costs + latency)
- ✅ Embedding versioning: tag each vector with model version (enables migration)
- ✅ Data backup: daily snapshots of vector DB
- ✅ Documentation: architecture, incident runbook, recall degradation playbook
Resources and Training
To go further and implement RAG in your projects, our Claude API for Developers course covers advanced RAG patterns in depth, LangChain integration, and production monitoring strategies. 3-day course, OPCO eligible in France (potential out-of-pocket cost: €0).
We also cover AI agents that orchestrate multiple RAG calls in our AI Agents course.
Frequently Asked Questions
What's the difference between RAG and fine-tuning?
Fine-tuning modifies the model's weights to teach it new knowledge (long process, expensive, risk of catastrophic forgetting). RAG keeps the model intact and injects relevant knowledge at inference time via document retrieval. RAG is more flexible, less expensive, and allows real-time knowledge updates without retraining.
Which embedding model should I choose for production in 2026?
For most cases: OpenAI's text-embedding-3-small (good performance/cost trade-off). For critical multilingual applications: Cohere embed-multilingual-v3. To reduce costs and maintain control: BAAI/bge-large-en-v1.5 self-hosted. Avoid ada-002 (deprecated). Always prioritize models that produce vectors of 1024 dimensions or less for storage cost reasons.
Pinecone, Qdrant, or Weaviate for my vector database?
Pinecone if you want serverless with no infrastructure management (best for MVPs and small teams). Qdrant if you want the best performance/price ratio and accept managing hosting (Docker or Kubernetes). Weaviate if you need a knowledge graph in addition to vectors. For < 100k vectors: PostgreSQL with pgvector is sufficient and reduces your stack.
How to measure retrieval quality in production?
Three key metrics: (1) Recall@k: proportion of relevant documents retrieved in the top k results. Aim for >90% at k=5. (2) MRR (Mean Reciprocal Rank): position of the first relevant result. Aim for >0.8. (3) p95 Latency: retrieval time at 95th percentile. Aim for <200ms for good UX. Track these metrics continuously with golden test sets and alert on degradations.