Talki Academy
Case StudyCFO / Engineering Lead22 min read

Case Study: How a Fintech Cut AI Infrastructure Costs from €50,000 to €5,000/year

In Q4 2025, Nexus Finance (55 employees, anonymized) was spending €4,200/month on LLM APIs — mostly GPT-4 Turbo for document analysis. Three months later, that bill was €415/month. Same volume, same quality SLA, 90.1% lower spend. This article documents every technique applied, with working code and a downloadable cost calculator.

€50,400
Annual spend before
€4,980
Annual spend after
90.1%
Cost reduction
3 months
Implementation time

The Company: Nexus Finance (Anonymized)

Nexus Finance is a 55-person B2B fintech based in Geneva, serving independent insurance brokers across France, Belgium, and Switzerland. Their core product is a document intelligence platform: brokers upload insurance contracts, and the platform extracts key terms, flags coverage gaps, and generates comparison reports for clients.

By mid-2025, the platform was processing roughly 42,000 documents per month — a mix of contract PDFs (20-80 pages), broker query emails, and structured data exports. Every document touched an LLM at least once; most touched one three to four times.

Profile at the start of the project (October 2025)

  • 55 employees, €7M ARR, Series A
  • 42,000 documents/month processed by LLMs
  • ~1.2M LLM API calls/month (3 calls per document on average)
  • Stack: Python FastAPI backend, PostgreSQL, Redis (existing), GPT-4 Turbo + text-embedding-ada-002
  • AI infrastructure budget: €4,200/month (€50,400/year) — one of the top 3 operating costs

Baseline Spend Breakdown: €4,200/month

Before any optimization, the team ran a full spend audit. This is the single most important step — most teams discover that their intuition about where the money goes is wrong by 40-60%.

Use CaseModelCalls/monthAvg tokens€/month
Contract analysis (extraction)GPT-4 Turbo420,0003,800 in / 600 out€2,080
Coverage gap detectionGPT-4 Turbo380,0002,200 in / 450 out€1,100
Document classificationGPT-4 Turbo210,000800 in / 60 out€380
Email summarizationGPT-4 Turbo85,0001,200 in / 200 out€320
Embeddings (ada-002)text-embedding-ada-0021,200,0001,100 in€240
Support chatbotGPT-4 Turbo12,0002,400 in / 350 out€80
Total€4,200

Key finding from the audit: document classification was using GPT-4 Turbo for tasks a 7B model handles equally well. 210,000 calls at €380/month for binary classification (is this an auto, health, or liability contract?) — a task Mistral 7B solves at 96% accuracy and 1/40th the cost.

Step 1 — The Spend Audit Script

Before touching any optimization, count your actual tokens. Character-based estimates are typically 25-40% too low. This Python script instruments your OpenAI calls and logs real token usage by use case:

# ai_cost_audit.py — instrument existing OpenAI calls
import json
import functools
from datetime import datetime
from openai import OpenAI

client = OpenAI()
COST_LOG = []

# Current pricing (USD/1M tokens, May 2026)
PRICING = {
    "gpt-4-turbo":         {"input": 10.00, "output": 30.00},
    "gpt-4o":              {"input":  2.50, "output": 10.00},
    "gpt-4o-mini":         {"input":  0.15, "output":  0.60},
    "text-embedding-ada-002": {"input": 0.10, "output": 0.00},
    "claude-sonnet-4-5":   {"input":  3.00, "output": 15.00},
    "claude-haiku-4-5":    {"input":  0.80, "output":  4.00},
}

def tracked_completion(use_case: str):
    """Decorator: wraps any function calling client.chat.completions.create"""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            result = func(*args, **kwargs)
            usage = result.usage
            model = result.model
            pricing = PRICING.get(model, {"input": 0, "output": 0})
            cost_usd = (
                usage.prompt_tokens * pricing["input"] / 1_000_000 +
                usage.completion_tokens * pricing["output"] / 1_000_000
            )
            COST_LOG.append({
                "use_case": use_case,
                "model": model,
                "timestamp": datetime.utcnow().isoformat(),
                "prompt_tokens": usage.prompt_tokens,
                "completion_tokens": usage.completion_tokens,
                "cost_usd": round(cost_usd, 6),
            })
            return result
        return wrapper
    return decorator

# Usage: decorate your existing functions
@tracked_completion("contract_analysis")
def analyze_contract(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": "Extract key insurance terms..."},
            {"role": "user", "content": text},
        ],
        max_tokens=600,
    )
    return response.choices[0].message.content

# After 24h of instrumented production traffic:
def generate_report():
    from collections import defaultdict
    by_use_case = defaultdict(lambda: {"calls": 0, "cost_usd": 0})
    for entry in COST_LOG:
        by_use_case[entry["use_case"]]["calls"] += 1
        by_use_case[entry["use_case"]]["cost_usd"] += entry["cost_usd"]

    print(f"{'Use Case':<30} {'Calls':>10} {'Cost/day':>12} {'Proj/mo':>12}")
    print("-" * 68)
    total = 0
    for use_case, data in sorted(by_use_case.items(), key=lambda x: -x[1]["cost_usd"]):
        monthly = data["cost_usd"] * 30
        total += monthly
        print(f"{use_case:<30} {data['calls']:>10,} {data['cost_usd']:>11.2f}$ {monthly:>11.2f}$")
    print(f"\n{'TOTAL PROJECTED/MONTH':<30} {'':>10} {'':>12} {total:>11.2f}$")

# Run in production for 48-72h, then call:
# generate_report()

Nexus Finance audit result

After 72h of instrumented production traffic: projected monthly spend was €4,287 — within 2% of the billing dashboard. The audit also revealed that 18% of contract_analysis calls were duplicate documents re-processed after minor edits. That alone was €380/month of avoidable spend.

Step 2 — Semantic Caching with Redis Stack

Semantic caching stores LLM responses indexed by an embedding of the query. When a new query arrives, it is embedded and compared against the cache. If cosine similarity exceeds the threshold (0.92 worked well here), the cached response is returned instantly — zero API cost, ~2ms latency.

Nexus Finance already had Redis in their stack for session management. Upgrading to Redis Stack (free, adds vector search) took one afternoon. The result: 45% cache hit rate, saving €1,890/month.

# semantic_cache.py — production-ready semantic cache
import hashlib
import json
import numpy as np
from redis import Redis
from redis.commands.search.field import VectorField, TextField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from openai import OpenAI

client = OpenAI()
redis_client = Redis(host="localhost", port=6379, decode_responses=False)

CACHE_INDEX = "semantic_cache"
VECTOR_DIM = 1536  # ada-002 dimension
SIMILARITY_THRESHOLD = 0.92  # tune per use case; 0.90-0.95 is the sweet spot
TTL_SECONDS = 86400 * 7  # cache for 7 days

def create_cache_index():
    """Run once on startup."""
    schema = (
        TextField("use_case"),
        VectorField("embedding", "HNSW", {
            "TYPE": "FLOAT32",
            "DIM": VECTOR_DIM,
            "DISTANCE_METRIC": "COSINE",
        }),
    )
    try:
        redis_client.ft(CACHE_INDEX).create_index(
            schema,
            definition=IndexDefinition(prefix=["cache:"], index_type=IndexType.HASH)
        )
    except Exception:
        pass  # index already exists

def get_embedding(text: str) -> list[float]:
    """Embed with ada-002 (or swap for self-hosted all-minilm for zero cost)."""
    response = client.embeddings.create(model="text-embedding-ada-002", input=text[:8000])
    return response.data[0].embedding

def cache_lookup(query: str, use_case: str) -> str | None:
    """Return cached response if similarity >= threshold, else None."""
    query_embedding = get_embedding(query)
    q = (
        Query(f"*=>[KNN 1 @embedding $vec AS score]")
        .sort_by("score")
        .return_fields("response", "score", "use_case")
        .dialect(2)
    )
    results = redis_client.ft(CACHE_INDEX).search(
        q,
        query_params={"vec": np.array(query_embedding, dtype=np.float32).tobytes()}
    )
    if results.docs:
        doc = results.docs[0]
        # Redis COSINE returns distance (lower = more similar), convert to similarity
        similarity = 1 - float(doc.score)
        if similarity >= SIMILARITY_THRESHOLD and doc.use_case == use_case:
            return doc.response
    return None

def cache_store(query: str, use_case: str, response: str) -> None:
    """Store a query-response pair in the cache."""
    embedding = get_embedding(query)
    key = f"cache:{hashlib.sha256(query.encode()).hexdigest()}"
    mapping = {
        "query": query,
        "use_case": use_case,
        "response": response,
        "embedding": np.array(embedding, dtype=np.float32).tobytes(),
    }
    redis_client.hset(key, mapping=mapping)
    redis_client.expire(key, TTL_SECONDS)

# Usage in your existing pipeline:
def analyze_contract_cached(text: str) -> str:
    cached = cache_lookup(text, "contract_analysis")
    if cached:
        return cached  # 0 API cost, ~2ms

    # Miss: call the API and store result
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": "Extract key insurance terms..."},
            {"role": "user", "content": text},
        ],
        max_tokens=600,
    ).choices[0].message.content

    cache_store(text, "contract_analysis", response)
    return response

⚠️ Tune the similarity threshold before deploying

0.92 works for insurance document queries. For creative or open-ended tasks, lower to 0.88. For exact extraction tasks (fill-in-the-blank), raise to 0.96. Test with 200 sample pairs from production traffic before setting the threshold in stone.

Step 3 — Model Tiering with LiteLLM

Not all queries need GPT-4. Nexus Finance scored every incoming request on a complexity heuristic, then routed to the appropriate model tier. LiteLLM acts as the unified proxy — your application code calls one endpoint, LiteLLM handles routing, fallbacks, and spend tracking.

TierModelUse Cases% of volumeCost/1M out tokens
FastMistral 7B Q4 (Ollama)Classification, routing, short extractions65%€0 (self-hosted)
StandardClaude Haiku 4.5Summarization, medium extraction, email replies25%$4/1M
PremiumClaude Sonnet 4.5Complex contract analysis, gap detection, reports10%$15/1M
# litellm_router.py — intelligent model routing
# Install: pip install litellm
import litellm

# LiteLLM config (also works as litellm_config.yaml for the proxy server)
litellm.set_verbose = False

MODELS = {
    "fast":     "ollama/mistral:7b-instruct-q4_K_M",   # self-hosted
    "standard": "anthropic/claude-haiku-4-5-20251001",
    "premium":  "anthropic/claude-sonnet-4-5",
}

COMPLEXITY_WEIGHTS = {
    "output_length_requested": 2.0,  # >400 tokens → likely complex
    "document_pages": 1.5,           # >5 pages → premium
    "requires_reasoning": 3.0,       # "compare", "analyze tradeoffs", "explain why"
    "structured_output": 0.5,        # JSON extraction → fast model handles fine
}

REASONING_KEYWORDS = {"compare", "analyze", "why", "tradeoffs", "recommend", "risk"}

def score_complexity(prompt: str, max_output_tokens: int, doc_pages: int = 1) -> float:
    score = 0.0
    if max_output_tokens > 400:
        score += COMPLEXITY_WEIGHTS["output_length_requested"]
    if doc_pages > 5:
        score += COMPLEXITY_WEIGHTS["document_pages"]
    if any(kw in prompt.lower() for kw in REASONING_KEYWORDS):
        score += COMPLEXITY_WEIGHTS["requires_reasoning"]
    if "json" in prompt.lower() or "extract" in prompt.lower():
        score -= COMPLEXITY_WEIGHTS["structured_output"]  # simpler task
    return score

def route(prompt: str, max_tokens: int = 200, doc_pages: int = 1) -> str:
    score = score_complexity(prompt, max_tokens, doc_pages)
    if score < 1.5:
        tier = "fast"
    elif score < 4.0:
        tier = "standard"
    else:
        tier = "premium"
    return MODELS[tier]

def call_llm(system: str, user: str, max_tokens: int = 200, doc_pages: int = 1) -> str:
    model = route(user, max_tokens, doc_pages)
    response = litellm.completion(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content

# Example: classification routes to Mistral (fast), complex analysis to Sonnet (premium)
contract_type = call_llm(
    system="Classify this insurance document. Reply with one of: auto, health, liability, property, other.",
    user=contract_text,
    max_tokens=10,
    doc_pages=1,
)  # → Mistral 7B, ~€0

coverage_gaps = call_llm(
    system="Analyze this multi-policy contract. Identify coverage gaps and recommend additions.",
    user=full_contract_text,
    max_tokens=800,
    doc_pages=42,
)  # → Claude Sonnet (premium)

Step 4 — Self-Hosted Inference with Ollama

For the Fast tier (65% of volume), Nexus Finance deployed Ollama on a Hetzner AX102 server (2× RTX 4090, €89/month). Mistral 7B Instruct Q4_K_M runs at 120 tokens/second on this hardware, handling 40 concurrent requests with p95 latency of 820ms — faster than GPT-4 Turbo's API at peak hours.

# docker-compose.yml — Ollama + GPU passthrough
version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - OLLAMA_NUM_PARALLEL=8       # concurrent requests per GPU
      - OLLAMA_MAX_LOADED_MODELS=2  # keep 2 models warm (Mistral + Nomic embed)

volumes:
  ollama_data:

# After docker compose up -d:
# Pull the model (downloads ~4.1GB)
# docker exec ollama ollama pull mistral:7b-instruct-q4_K_M
# docker exec ollama ollama pull nomic-embed-text  # free embeddings (replaces ada-002)

# ---
# Test inference speed:
# curl http://localhost:11434/api/generate -d '{
#   "model": "mistral:7b-instruct-q4_K_M",
#   "prompt": "Classify: auto, health, liability, or property? Document: ...",
#   "stream": false
# }'
# Expected on 2× RTX 4090: ~120 tok/s, first token in 180ms

Hardware sizing guide

  • <50k requests/month: CPU VPS (4 vCPU, 16GB RAM) — Mistral 7B Q4, €20-35/month, ~8 tok/s
  • 50k–500k requests/month: Single RTX 4090 (Hetzner AX61) — €45-65/month, ~80 tok/s
  • >500k requests/month: 2× RTX 4090 (Hetzner AX102) — €89-109/month, ~160 tok/s
  • Break-even vs GPT-4o mini: 100k requests/month with 500-token avg output

Step 5 — Batch Processing for Non-Real-Time Documents

Nexus Finance discovered that 35% of their document processing was triggered by overnight sync jobs — no human waiting for results. These could use the OpenAI Batch API (50% discount) or Claude's batch endpoint.

# batch_processor.py — OpenAI Batch API (50% cheaper, 24h turnaround)
import json
import time
from pathlib import Path
from openai import OpenAI

client = OpenAI()

def create_batch_job(documents: list[dict], output_file: str) -> str:
    """
    documents: [{"id": "doc_001", "text": "...", "use_case": "email_summary"}]
    Returns batch job ID.
    """
    # Build JSONL file of requests
    requests = []
    for doc in documents:
        requests.append({
            "custom_id": doc["id"],
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",  # 50% of already-cheap mini pricing
                "messages": [
                    {"role": "system", "content": "Summarize this insurance email in 2 sentences."},
                    {"role": "user", "content": doc["text"][:4000]},
                ],
                "max_tokens": 150,
            }
        })

    # Write JSONL
    jsonl_path = Path("/tmp/batch_requests.jsonl")
    with open(jsonl_path, "w") as f:
        for req in requests:
            f.write(json.dumps(req) + "\n")

    # Upload file
    with open(jsonl_path, "rb") as f:
        batch_file = client.files.create(file=f, purpose="batch")

    # Create batch job (processed within 24h, 50% discount)
    batch_job = client.batches.create(
        input_file_id=batch_file.id,
        endpoint="/v1/chat/completions",
        completion_window="24h",
    )
    print(f"Batch job created: {batch_job.id}")
    return batch_job.id

def retrieve_batch_results(batch_id: str) -> list[dict]:
    """Poll until complete, then parse results."""
    while True:
        batch = client.batches.retrieve(batch_id)
        if batch.status == "completed":
            break
        if batch.status in {"failed", "expired", "cancelled"}:
            raise RuntimeError(f"Batch failed: {batch.status}")
        print(f"Status: {batch.status} — waiting 60s...")
        time.sleep(60)

    # Download results
    result_file = client.files.content(batch.output_file_id)
    results = []
    for line in result_file.text.splitlines():
        item = json.loads(line)
        results.append({
            "id": item["custom_id"],
            "response": item["response"]["body"]["choices"][0]["message"]["content"],
        })
    return results

# Cron job: run nightly for previous day's non-urgent documents
# Cost: €0.075/1M input (vs €0.15 standard), €0.30/1M output (vs €0.60)
# Nexus Finance batch savings: €85/month → €42/month

Step 6 — Prompt Compression (Quick Win)

The team's original system prompts were verbose — written by developers who added clarifying sentences “just in case.” A careful audit revealed 38% of input tokens were filler. Removing redundant instructions, collapsing examples into structured format, and trimming preamble saved €420/month with zero quality impact.

Before (847 tokens)

You are an expert insurance document analysis assistant with deep knowledge of European insurance law, including French, Belgian, and Swiss regulatory frameworks. Your task is to carefully review the insurance contract provided by the user and extract the following information in a structured format. Please be thorough and accurate in your analysis. If you are unsure about any field, indicate this clearly. The output should be in JSON format...

After (168 tokens)

Extract from this insurance contract. Output JSON only, no prose. Fields: policy_type, coverage_limit_eur, deductible_eur, exclusions (array), renewal_date (ISO), insurer_name. If a field is absent, use null.

A 5-minute automated compression check using tiktoken catches prompt bloat before it ships to production:

# prompt_audit.py — flag verbose prompts in CI
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 / Claude-compatible

PROMPT_BUDGET = {
    "classification": 80,    # tokens
    "extraction": 200,
    "analysis": 500,
    "summarization": 150,
}

def check_prompt_size(system_prompt: str, use_case: str) -> None:
    tokens = len(enc.encode(system_prompt))
    budget = PROMPT_BUDGET.get(use_case, 300)
    if tokens > budget:
        raise ValueError(
            f"System prompt for '{use_case}' uses {tokens} tokens "
            f"(budget: {budget}). Compress before deploying.\n"
            f"Estimated monthly waste: ${tokens / budget * 0.01:.2f}/month "
            f"at 100k calls/day"
        )
    print(f"OK {use_case}: {tokens} tokens (budget: {budget})")

Results: Before vs After

TechniqueSavings/monthQuality impactLatency impact
Semantic caching (Redis)€1,890None (same model)-98% on cache hits
Model tiering (Mistral Fast)€1,200-3% on classification+200ms vs GPT-4 API
Prompt compression€420NoneFaster (fewer tokens)
Batch processing (non-RT)€310None (same model)+~12h on batch queue
Self-hosted embeddings (Nomic)€240-1% retrieval recall-30ms vs ada-002
Dedup detection€380NoneNone
New infrastructure cost-€125 (Hetzner+Redis)
Net monthly savings€3,785/month
€4,200/mo
Before
€415/mo
After
90.1%
Reduction
2.8 months
Engineering ROI
−2.1%
Quality delta

The 2.1% quality delta comes entirely from the Fast tier (Mistral 7B for classification). Human evaluators — three insurance domain experts reviewing 500 randomly sampled outputs — rated Mistral classification quality at 94.8% parity with GPT-4 Turbo. For extraction and analysis tasks (Standard and Premium tiers), no measurable quality difference was detected.

Download: AI Cost Optimization Calculator

Use the spreadsheet below to model your own current spend, projected savings from each technique, and 12-month ROI. It includes tabs for spend audit, hardware sizing, caching ROI, and batch processing scenarios.

📊

AI Cost Optimization Calculator

CSV spreadsheet · Opens in Excel, Google Sheets, LibreOffice · 5 tabs: Spend Audit, Savings Model, Hardware Sizing, Caching ROI, 12-Month Projection

Download Calculator (CSV)

FAQ

Is 90% cost reduction realistic without sacrificing quality?

Yes, for workloads with diverse complexity levels. The key insight is that 60-70% of most enterprise LLM workloads involve routine classification or extraction tasks that Mistral 7B handles at 93% quality parity with GPT-4 Turbo — at 1/40th the cost. Quality degradation only becomes meaningful for open-ended reasoning tasks, which typically represent 10-15% of production volume. Tier your queries by complexity, use the right model for each tier, and measure quality at each tier before switching.

How much engineering time does this optimization require?

Nexus Finance's lead engineer spent approximately 3 weeks of focused work: 1 week auditing actual token usage and building the cost model, 1 week implementing LiteLLM routing + Redis semantic cache, 1 week setting up Ollama on a Hetzner GPU server and running A/B quality tests. The ongoing maintenance is under 2 hours/week. Total investment: ~120 engineering hours. At a €100/hour rate, that is €12,000 — recovered in 3 months from savings.

What hardware is needed for Ollama self-hosting to make financial sense?

For 500,000+ requests/month, a Hetzner AX102 (2× RTX 4090, €89-109/month) is the proven choice. It runs Mistral 7B Q4 at 120 tokens/second, handles 40 concurrent requests, and delivers p95 latency of 820ms — faster than GPT-4 Turbo's API. For smaller volumes (under 50,000 requests/month), a CPU-only VPS running Mistral 7B Q4 (€20-35/month) is enough. The break-even vs GPT-4o mini is typically at 100,000+ requests/month.

Does semantic caching work for financial/legal documents where every query is unique?

Semantic caching works on query similarity, not exact matches. In Nexus Finance's case, 45% of queries were semantically similar — insurance brokers asking variations of the same questions about coverage limits, exclusion clauses, and premium calculations. Using Redis Stack with cosine similarity threshold of 0.92, those queries hit cache. For truly unique queries (complex contract analysis), the cache miss falls through to the appropriate model tier. The 45% cache hit rate is conservative; document-heavy pipelines with FAQ patterns often reach 60-70%.

What is LiteLLM and why use it instead of calling APIs directly?

LiteLLM is an open-source proxy that presents a unified OpenAI-compatible API surface for 100+ LLM providers and self-hosted models. Instead of maintaining separate SDK integrations for Anthropic, OpenAI, and Ollama, you call one endpoint and LiteLLM routes to the right backend. For cost optimization, it adds per-model spend tracking, automatic fallbacks, and routing rules (e.g., route to Ollama when latency budget is <500ms, fall back to Claude Haiku if Ollama is overloaded). Setup takes under an hour with Docker.

Can I download a spreadsheet to model my own AI costs?

Yes — the AI Cost Optimization Calculator is available at the bottom of this article. It includes tabs for: current spend audit (by model + use case), projected savings from each technique, hardware sizing for self-hosting, and a 12-month ROI projection. It is a CSV file that opens in Excel and Google Sheets.

Apply these techniques to your own infrastructure

Our AI Infrastructure Optimization training covers LiteLLM setup, Ollama deployment, semantic caching, and cost modeling — with your actual workload data.

View Training Catalog →

🇫🇷 Lire en français