A mid-sized law firm manages 4,000 contracts — supplier agreements, NDAs, SLAs, and lease renewals. When a client asks "does our Azure agreement cap liability at 2× annual fees?", the answer is buried in clause 14.3 of a 90-page PDF. The analyst spends 45 minutes searching. With a local RAG system, the same query returns a cited answer in under 3 seconds — at zero marginal cost, with no contract data leaving the firm's servers.
This tutorial builds exactly that system: a Legal Contract Q&A assistant powered by LangChain, Ollama (local LLM inference), and ChromaDB (open-source vector database). All components are free, self-hosted, and GDPR-compliant.
Architecture: Three Components, Zero Vendor Lock-in
The system follows the standard RAG pattern with two phases. Offline indexing ingests documents once (then again on update). Online retrieval answers queries in real time.
┌──────────────────────────────────────────────────┐
│ LEGAL CONTRACT RAG SYSTEM │
├─────────────┬──────────────┬────────────────────┤
│ Ollama │ ChromaDB │ LangChain │
│ (local LLM) │ (vector DB) │ (orchestration) │
│ llama3.2:8b │ docker mode │ retriever + chain │
│ nomic-embed │ ~2 GB / 4k │ prompt template │
└─────────────┴──────────────┴────────────────────┘
INDEXING (once / per update):
PDF → chunks → embeddings → ChromaDB
QUERY (~2–4 s per request):
Question → embed → retrieve → LLM → answer
Prerequisites and Environment Setup
- Python 3.11+ and pip
- Docker Desktop (for ChromaDB server mode) — or 8 GB RAM for embedded mode
- Ollama installed from ollama.com — runs on macOS, Linux, Windows
- 16 GB RAM recommended for llama3.2:8b; use llama3.2:3b on machines with 8 GB
- ~10 GB disk space for models
# Install Python dependencies
pip install langchain langchain-community langchain-chroma \
chromadb ollama pypdf ragas datasets python-dotenv
# .env.example — copy to .env and adjust
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_LLM_MODEL=llama3.2:8b
OLLAMA_EMBED_MODEL=nomic-embed-text
CHROMA_HOST=localhost
CHROMA_PORT=8000
CHROMA_COLLECTION=legal_contracts
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
RETRIEVER_K=5
# docker-compose.yml — ChromaDB server
version: "3.9"
services:
chromadb:
image: chromadb/chroma:0.6.3
ports:
- "8000:8000"
volumes:
- ./chroma_data:/chroma/chroma
environment:
- CHROMA_SERVER_AUTH_PROVIDER=none
- ANONYMIZED_TELEMETRY=false
restart: unless-stopped
Embedded vs server mode: For solo development, skip Docker entirely — ChromaDB runs embedded in your Python process. Replace chromadb.HttpClient(...) with chromadb.PersistentClient(path="./chroma_data"). Switch to server mode when you need multiple processes (e.g., an API server + a background ingestion job) to access the same collection.
Step 1: Pull Ollama Models
You need two models: an embedding model to vectorize documents and queries, and a chat LLM to generate answers. Both run locally after a one-time download.
# Pull both models — cached after first download
ollama pull nomic-embed-text # 274 MB embedding model
ollama pull llama3.2:8b # 4.7 GB LLM (use :3b if RAM < 12 GB)
# Verify
ollama list
# Quick smoke test
ollama run llama3.2:8b "What is a force majeure clause? One sentence."
# → A force majeure clause excuses a party from contractual obligations
# due to extraordinary events beyond their control.
Step 2: Document Ingestion Pipeline
The ingestion script loads PDF contracts, splits them into overlapping chunks, generates embeddings, and upserts into ChromaDB. Document IDs are derived from file path + chunk index — re-running on the same file never creates duplicates.
# ingest.py
import os, hashlib
from pathlib import Path
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_chroma import Chroma
import chromadb
load_dotenv()
embeddings = OllamaEmbeddings(
model=os.getenv("OLLAMA_EMBED_MODEL", "nomic-embed-text"),
base_url=os.getenv("OLLAMA_BASE_URL", "http://localhost:11434"),
)
chroma_client = chromadb.HttpClient(
host=os.getenv("CHROMA_HOST", "localhost"),
port=int(os.getenv("CHROMA_PORT", "8000")),
)
vector_store = Chroma(
client=chroma_client,
collection_name=os.getenv("CHROMA_COLLECTION", "legal_contracts"),
embedding_function=embeddings,
)
splitter = RecursiveCharacterTextSplitter(
chunk_size=int(os.getenv("CHUNK_SIZE", "1000")),
chunk_overlap=int(os.getenv("CHUNK_OVERLAP", "200")),
separators=["\n\n", "\n", ". ", " "], # respects paragraph structure
)
def stable_id(path: str, idx: int) -> str:
return hashlib.sha256(f"{path}::chunk_{idx}".encode()).hexdigest()[:16]
def ingest(pdf_path: str) -> int:
chunks = splitter.split_documents(PyPDFLoader(pdf_path).load())
for i, chunk in enumerate(chunks):
chunk.metadata.update({
"source_file": Path(pdf_path).name,
"chunk_index": i,
"doc_id": stable_id(pdf_path, i),
})
vector_store.add_documents(chunks, ids=[c.metadata["doc_id"] for c in chunks])
return len(chunks)
if __name__ == "__main__":
import sys
pdfs = list(Path(sys.argv[1] if len(sys.argv) > 1 else "./contracts").glob("**/*.pdf"))
total = sum(ingest(str(p)) for p in pdfs)
print(f"Stored {total} chunks from {len(pdfs)} contracts")
# Run with:
# docker compose up -d
# python ingest.py ./contracts
# → Stored 110 chunks from 2 contracts
Step 3: RAG Chain with LangChain
The chain embeds the query, retrieves the top-k chunks, and passes them with a citation-enforcing prompt to the local LLM. temperature=0 is critical for legal use — it eliminates random variation in answers.
# rag_chain.py
import os
from dotenv import load_dotenv
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import chromadb
load_dotenv()
OLLAMA_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
llm = ChatOllama(
model=os.getenv("OLLAMA_LLM_MODEL", "llama3.2:8b"),
base_url=OLLAMA_URL,
temperature=0, # deterministic for legal answers
)
embeddings = OllamaEmbeddings(
model=os.getenv("OLLAMA_EMBED_MODEL", "nomic-embed-text"),
base_url=OLLAMA_URL,
)
chroma_client = chromadb.HttpClient(
host=os.getenv("CHROMA_HOST", "localhost"),
port=int(os.getenv("CHROMA_PORT", "8000")),
)
vector_store = Chroma(
client=chroma_client,
collection_name=os.getenv("CHROMA_COLLECTION", "legal_contracts"),
embedding_function=embeddings,
)
retriever = vector_store.as_retriever(
search_kwargs={"k": int(os.getenv("RETRIEVER_K", "5"))}
)
SYSTEM = """You are a legal contract analyst. Answer based ONLY on the
excerpts provided. If the answer is not in the excerpts, say:
"I cannot find this information in the provided contracts."
Always cite the source file and clause when possible.
Contract excerpts:
{context}"""
prompt = ChatPromptTemplate.from_messages([
("system", SYSTEM),
("human", "{question}"),
])
def format_docs(docs):
return "\n\n---\n\n".join(
f"[Source: {d.metadata.get('source_file', 'unknown')}]\n{d.page_content}"
for d in docs
)
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
if __name__ == "__main__":
print("Legal Contract Assistant — type 'exit' to quit\n")
while True:
q = input("Question: ").strip()
if q.lower() in ("exit", "quit"):
break
print("\nAnswer:", chain.invoke(q), "\n")
Case Study: Legal Contract Assistant in Action
A law firm with 4,000 contract PDFs (~2.1 GB total) runs this system on a Mac Mini M4 (CPU-only). Ingestion: 42 minutes. Average query latency: 2.8 s. Three real query types demonstrate the value:
- Clause lookup: "What is the notice period in our AWS Enterprise Agreement?" → cited answer with clause reference in 2.1 s
- Cross-contract search: "Which vendor agreements allow subprocessors without prior written consent?" → retrieved 3 relevant contracts, synthesized answer in 4.4 s
- Risk flagging: "Are there contracts with uncapped liability exposure?" → scanned NDA chunks, returned 2 flagged contracts with exact clause references
Scope retrieval with metadata filters: retriever = vector_store.as_retriever(search_kwargs={"k": 5, "filter": {"source_file": {"$contains": "nda"}}}) Use this to restrict searches to specific contract categories and eliminate cross-pollination between unrelated agreement types.
Step 4: Evaluate with RAGAS
Before deploying to users, measure quality with three RAGAS metrics: Faithfulness (does the answer stick to retrieved context?), Answer Relevancy (does it address the question?), and Context Precision (are the retrieved chunks relevant?).
# evaluate.py
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from rag_chain import chain, retriever
eval_set = [
{
"question": "What is the liability cap in the Azure Master Agreement?",
"ground_truth": "Microsoft caps aggregate liability at amounts paid in the preceding 12 months, maximum USD 500,000.",
},
{
"question": "Which contracts allow assignment to affiliates without consent?",
"ground_truth": "The Stripe and Twilio agreements allow assignment to affiliates without prior consent.",
},
]
rows = []
for item in eval_set:
docs = retriever.invoke(item["question"])
rows.append({
"question": item["question"],
"answer": chain.invoke(item["question"]),
"contexts": [d.page_content for d in docs],
"ground_truth": item["ground_truth"],
})
scores = evaluate(Dataset.from_list(rows), metrics=[faithfulness, answer_relevancy, context_precision])
print(scores)
# {'faithfulness': 0.87, 'answer_relevancy': 0.83, 'context_precision': 0.79}
#
# faithfulness 0.87 → answers well-grounded in context (good for legal use)
# context_precision 0.79 → some off-topic chunks retrieved; try smaller chunk_size
RAGAS uses an LLM as judge — by default GPT-4. To keep evaluation free, configure it to use your local Ollama model: from ragas.llms import LangchainLLMWrapper; from rag_chain import llm; ragas_llm = LangchainLLMWrapper(llm). Note that local models (especially smaller ones) are less reliable as evaluators — use llama3.3:70b if available.
Production Checklist
- Idempotent ingestion: stable chunk IDs (path + index hash) ensure re-runs never duplicate chunks — safe to schedule as a nightly cron job
- Metadata tagging: add
contract_type, counterparty, effective_date to each chunk for filtered retrieval without full-corpus scans - Access control: ChromaDB has no built-in auth — front it with a FastAPI proxy + JWT tokens so users only query their authorized collections
- Backup: volume-mount
./chroma_data and run daily S3 sync — 4,000 vectorized contracts fit in ~2 GB - Quality monitoring: log question + retrieved chunks + answer triplets; run RAGAS weekly on a fixed benchmark set to catch corpus drift early
- Model upgrades: swap models by changing
OLLAMA_LLM_MODEL in .env — no code changes needed
What's Next
This tutorial gives you a working baseline. For production hardening, the next steps are hybrid BM25+vector search (improves Context Recall by ~15% on technical documents), cross-encoder reranking (better precision on ambiguous queries), and parent-child chunking (better answer completeness on long structured contracts). All covered in the Advanced RAG Implementation formation.