Talki Academy
Technical22 min read

Ollama + Open WebUI: Deploy Open-Source LLMs in Production (2026)

Complete technical guide to deploying Llama 3.3, Mistral, and CodeLlama in production with Ollama and Open WebUI. Step-by-step installation, performance benchmarks, cost analysis ($0 tokens vs $1500/month API), Python/REST integrations, production best practices.

By Talki Academy·Updated April 2, 2026

In 2026, deploying LLMs in production without relying on proprietary APIs has become a priority for many companies. Between exploding API costs at scale ($500-5000/month for intensive use), network latency issues, and data confidentiality constraints, self-hosted open-source models represent a credible alternative.

Ollama drastically simplifies local LLM deployment: a single command to install Llama 3.3 70B, Mistral Large, CodeLlama, or DeepSeek. Open WebUI provides a ChatGPT-like interface running on your infrastructure. Together, they enable moving from $1500/month in API calls to $80/month for cloud servers — without sacrificing quality for 80% of use cases.

Why Ollama + Open-Source LLMs in 2026?

Cost Analysis: Proprietary APIs vs Self-Hosting

Real case: SaaS startup with 500 active users generating 1M tokens/day (content generation, support chatbot, summaries). Let's compare costs over 12 months.

SolutionInfra/monthTokens/monthTotal/monthTotal/year
OpenAI GPT-4 Turbo$0$3000 (30M tokens)$3000$36,000
Claude Sonnet 4.5$0$900 (30M tokens)$900$10,800
Ollama + Llama 3.3 70B (GPU cloud)$180 (L4 24GB)$0$180$2,160
Ollama + Llama 3.3 70B (dedicated server)$89 (Hetzner GPU)$0$89$1,068
Ollama + Mistral 7B (CPU only)$29 (VPS 16 vCPU)$0$29$348

Achievable savings:

  • -94% cost reduction moving from Claude API to Ollama + Llama 3.3 on dedicated server ($10,800 → $1,068/year)
  • -99% cost reduction moving from GPT-4 to Ollama + Mistral 7B on VPS ($36,000 → $348/year)
  • Immediate ROI: payback in 1 month for volume > 100k tokens/day
  • Linear scalability: 10x more users = +1 GPU server ($180/month), not +$900/month in API calls

Ideal Use Cases for Ollama

Use CaseRecommended ModelRationale
Internal customer support chatbotLlama 3.3 8BSensitive data, no critical latency, high volume
Technical documentation generationCodeLlama 34BCode specialized, quality > latency, offline OK
Automatic meeting summariesMistral 7BSimple task, very high volume, cost critical
Code assistant in IDEDeepSeek Coder 33BBest code quality, must be local (latency)
Contract analysis (confidential data)Llama 3.3 70BStrict GDPR, ultra-sensitive data, max quality
Support ticket classificationMistral 7B (quantized Q4)Simple task, <500ms latency required

Ollama Installation: macOS, Linux, Docker

macOS Installation (Apple Silicon M1/M2/M3)

Ollama leverages the integrated GPU of Apple Silicon chips via Metal. A Mac M3 Max 128GB can run Llama 3.3 70B at ~15 tokens/s.

# One-command installation curl -fsSL https://ollama.com/install.sh | sh # Check installation ollama --version # ollama version 0.3.14 # Start server (automatically runs in background) ollama serve # Download and run Llama 3.3 70B ollama run llama3.3:70b # First run: model download (~40GB) # Then: interactive conversation starts >>> Explain RAG in 3 simple sentences. # Quick performance test >>> /bye # Exit conversation # List downloaded models ollama list # NAME SIZE MODIFIED # llama3.3:70b 40GB 2 minutes ago

Expected result: On Mac M3 Max, first response in ~8s, subsequent tokens at ~15 tok/s. RAM usage: ~50GB for 70B model.

Linux Installation (Ubuntu/Debian)

# Installation curl -fsSL https://ollama.com/install.sh | sh # If you have NVIDIA GPU, install CUDA drivers # Check GPU detection nvidia-smi # Start Ollama ollama serve # In another terminal: download multiple models ollama pull llama3.3:70b # 40GB - best quality ollama pull llama3.3:8b # 4.7GB - faster ollama pull mistral:7b # 4.1GB - excellent multilingual ollama pull codellama:34b # 19GB - code specialized # Comparative latency test time ollama run llama3.3:8b "Summarize Docker in 2 sentences" # real 0m2.341s (~15 tokens/s on RTX 4090) time ollama run llama3.3:70b "Summarize Docker in 2 sentences" # real 0m5.127s (~8 tokens/s on RTX 4090) # Run Ollama as systemd service (production) sudo systemctl enable ollama sudo systemctl start ollama sudo systemctl status ollama

Docker Installation (Multi-Platform Production)

# docker-compose.yml for Ollama + Open WebUI version: '3.8' services: # Ollama: model server ollama: image: ollama/ollama:latest container_name: ollama volumes: - ollama_data:/root/.ollama # Model storage ports: - "11434:11434" environment: - OLLAMA_HOST=0.0.0.0 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] # Requires nvidia-docker restart: unless-stopped # Open WebUI: ChatGPT-like interface open-webui: image: ghcr.io/open-webui/open-webui:main container_name: open-webui ports: - "3000:8080" environment: - OLLAMA_BASE_URL=http://ollama:11434 - WEBUI_AUTH=false # Or true with user management volumes: - open_webui_data:/app/backend/data depends_on: - ollama restart: unless-stopped volumes: ollama_data: open_webui_data:
# Start services docker-compose up -d # Wait for Ollama to be ready (~10s) sleep 10 # Download models in container docker exec -it ollama ollama pull llama3.3:70b docker exec -it ollama ollama pull mistral:7b docker exec -it ollama ollama pull codellama:34b # Check logs docker-compose logs -f ollama # Access Open WebUI # Open http://localhost:3000 in browser # Interface ready, models available in dropdown # GPU monitoring (if NVIDIA) watch -n 1 nvidia-smi # Expected: GPU utilization at ~90% during inference

Model Comparison: Llama, Mistral, CodeLlama, DeepSeek

ModelSizeRAM RequiredSpeed (RTX 4090)QualityUse Case
Llama 3.3 70B40GB48GB8-12 tok/s⭐⭐⭐⭐⭐General use, max quality, similar to GPT-4 Turbo
Llama 3.3 8B4.7GB8GB35-50 tok/s⭐⭐⭐⭐Critical latency, simple chatbots, CPU viable
Mistral 7B4.1GB8GB40-60 tok/s⭐⭐⭐⭐Excellent multilingual, minimal cost, CPU OK
Mistral Large 2123GB140GB4-6 tok/s⭐⭐⭐⭐⭐Top-tier multilingual, competes with GPT-4
CodeLlama 34B19GB24GB12-18 tok/s⭐⭐⭐⭐Code generation, technical documentation
DeepSeek Coder 33B18GB24GB14-20 tok/s⭐⭐⭐⭐⭐Best for Python/JS/TS code, better than CodeLlama
Qwen2.5 72B41GB48GB7-11 tok/s⭐⭐⭐⭐⭐Multilingual (excellent Chinese), math, reasoning

Quality Benchmarks (MMLU, HumanEval, MT-Bench)

Scores on academic benchmarks (higher is better). MMLU = general knowledge, HumanEval = code generation, MT-Bench = multi-turn conversations.

ModelMMLUHumanEvalMT-BenchAPI Equivalent
Llama 3.3 70B82.0%69.5%8.2/10≈ GPT-4 Turbo, Claude Sonnet 3.5
DeepSeek Coder 33B66.4%78.6%7.1/10≈ GPT-3.5 Turbo (better code)
Mistral 7B62.5%40.2%6.8/10≈ GPT-3.5 Turbo
Llama 3.3 8B68.4%62.2%7.4/10≈ GPT-3.5 Turbo
Reference: GPT-4 Turbo86.4%67.0%8.9/10
Reference: Claude Opus 4.588.7%84.9%9.0/10

Benchmark conclusion: Llama 3.3 70B achieves 95% of GPT-4 Turbo quality on most tasks. For 80% of production use cases, it's more than sufficient — especially when it costs $0 in tokens vs $3000/month.

Integrations: Python, REST API, OpenAI Compatibility

Native Python Integration (ollama-python)

# Installation pip install ollama # Example 1: Simple chat completion import ollama response = ollama.chat( model='llama3.3:70b', messages=[ { 'role': 'system', 'content': 'You are a technical assistant expert in cloud computing.' }, { 'role': 'user', 'content': 'Explain the difference between Kubernetes and Docker Swarm in 3 points.' } ] ) print(response['message']['content']) # Expected output: # 1. **Complexity**: Kubernetes offers more features (auto-scaling, # rolling updates, service mesh) but requires more configuration. # Docker Swarm is simpler to start. # 2. **Ecosystem**: Kubernetes dominates the industry (CNCF, cloud native support), # Docker Swarm is declining. # 3. **Scale**: Kubernetes scales to thousands of nodes, Swarm suits # clusters <100 nodes.
# Example 2: Streaming (token-by-token responses) import ollama stream = ollama.chat( model='llama3.3:8b', messages=[{'role': 'user', 'content': 'Write a haiku about DevOps'}], stream=True ) for chunk in stream: print(chunk['message']['content'], end='', flush=True) # Output (progressive): # Code deployed late # Pipeline runs endlessly # Coffee, logs, success print() # Final newline
# Example 3: Code generation with metrics import ollama import time start = time.time() response = ollama.chat( model='codellama:34b', messages=[ { 'role': 'user', 'content': """Write a Python function that: 1. Reads a CSV file 2. Filters rows where 'status' column == 'active' 3. Groups by 'category' and counts occurrences 4. Returns a dict {category: count} Use pandas. Include error handling.""" } ], options={ 'temperature': 0.2, # Less creativity for code 'top_p': 0.9 } ) elapsed = time.time() - start code = response['message']['content'] print(code) print(f"\n⏱️ Generated in {elapsed:.2f}s") print(f"📊 {len(code.split())} words, {response['eval_count']} tokens") # Expected output: # import pandas as pd # from typing import Dict # # def count_active_by_category(filepath: str) -> Dict[str, int]: # """ # Count active entries by category from CSV. # ... # """ # try: # df = pd.read_csv(filepath) # active_df = df[df['status'] == 'active'] # counts = active_df.groupby('category').size().to_dict() # return counts # except FileNotFoundError: # raise ValueError(f"File not found: {filepath}") # except KeyError as e: # raise ValueError(f"Missing column: {e}") # # ⏱️ Generated in 8.3s # 📊 142 words, 487 tokens

REST API: OpenAI Compatibility (drop-in replacement)

# Ollama exposes OpenAI-compatible API at /v1/chat/completions # You can use the OpenAI SDK directly! from openai import OpenAI # Point to Ollama instead of OpenAI client = OpenAI( base_url='http://localhost:11434/v1', api_key='ollama' # Required by SDK but not used by Ollama ) # IDENTICAL code to OpenAI API response = client.chat.completions.create( model='llama3.3:70b', messages=[ { 'role': 'system', 'content': 'You are a web security expert.' }, { 'role': 'user', 'content': 'Explain XSS and give an exploit example + mitigation.' } ], temperature=0.7, max_tokens=500 ) print(response.choices[0].message.content) # Migration from OpenAI to Ollama = change 2 lines (base_url + model) # All other code remains identical!
# Example: direct REST call with curl curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.3:8b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ], "temperature": 0.1 }' # JSON response (OpenAI format) { "id": "chatcmpl-xyz", "object": "chat.completion", "created": 1735689600, "model": "llama3.3:8b", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "The capital of France is Paris." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 18, "completion_tokens": 7, "total_tokens": 25 } }

LangChain Integration (RAG, Agents, Tool Use)

# Installation pip install langchain langchain-community ollama # RAG with Ollama: Q&A system on documentation from langchain_community.llms import Ollama from langchain.embeddings import OllamaEmbeddings from langchain.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.chains import RetrievalQA from langchain.document_loaders import TextLoader # 1. Load documentation loader = TextLoader("docs/kubernetes-guide.txt") documents = loader.load() # 2. Split into chunks text_splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50 ) texts = text_splitter.split_documents(documents) # 3. Create embeddings (nomic-embed-text optimized for RAG) embeddings = OllamaEmbeddings(model="nomic-embed-text") # 4. Index in ChromaDB vectorstore = Chroma.from_documents( documents=texts, embedding=embeddings, persist_directory="./chroma_db" ) # 5. Create RAG chain llm = Ollama(model="llama3.3:70b", temperature=0) qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever(search_kwargs={"k": 3}), return_source_documents=True ) # 6. Ask questions result = qa_chain({"query": "How to configure auto-scaling in Kubernetes?"}) print(result['result']) print(f"\nSources: {len(result['source_documents'])} documents used") # Output: # To configure auto-scaling in Kubernetes, use a # HorizontalPodAutoscaler (HPA). Define target metrics # (CPU, memory or custom metrics) and min/max replica limits. # Example: kubectl autoscale deployment nginx --cpu-percent=50 --min=2 --max=10 # # Sources: 3 documents used

Production Deployment: Docker Compose, GPU, Load Balancing

Recommended Production Architecture

# docker-compose.production.yml version: '3.8' services: # NGINX: load balancer to distribute across multiple Ollama workers nginx: image: nginx:alpine ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro depends_on: - ollama-worker-1 - ollama-worker-2 restart: unless-stopped # Ollama Worker 1 (GPU 0) ollama-worker-1: image: ollama/ollama:latest container_name: ollama-worker-1 environment: - OLLAMA_HOST=0.0.0.0:11434 - CUDA_VISIBLE_DEVICES=0 # GPU 0 volumes: - ollama_models:/root/.ollama deploy: resources: reservations: devices: - driver: nvidia device_ids: ['0'] capabilities: [gpu] healthcheck: test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"] interval: 30s timeout: 10s retries: 3 restart: unless-stopped # Ollama Worker 2 (GPU 1) ollama-worker-2: image: ollama/ollama:latest container_name: ollama-worker-2 environment: - OLLAMA_HOST=0.0.0.0:11434 - CUDA_VISIBLE_DEVICES=1 # GPU 1 volumes: - ollama_models:/root/.ollama deploy: resources: reservations: devices: - driver: nvidia device_ids: ['1'] capabilities: [gpu] healthcheck: test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"] interval: 30s timeout: 10s retries: 3 restart: unless-stopped # Open WebUI: user interface open-webui: image: ghcr.io/open-webui/open-webui:main ports: - "3000:8080" environment: - OLLAMA_BASE_URL=http://nginx:80 - WEBUI_AUTH=true - WEBUI_JWT_SECRET_KEY=${JWT_SECRET} volumes: - open_webui_data:/app/backend/data depends_on: - nginx restart: unless-stopped # Prometheus: metrics monitoring prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' restart: unless-stopped # Grafana: dashboards grafana: image: grafana/grafana:latest ports: - "3001:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD} volumes: - grafana_data:/var/lib/grafana - ./grafana/dashboards:/etc/grafana/provisioning/dashboards:ro depends_on: - prometheus restart: unless-stopped volumes: ollama_models: open_webui_data: prometheus_data: grafana_data:
# nginx.conf: round-robin load balancing between workers upstream ollama_backend { least_conn; # Distribute to least loaded worker server ollama-worker-1:11434 max_fails=3 fail_timeout=30s; server ollama-worker-2:11434 max_fails=3 fail_timeout=30s; } server { listen 80; location / { proxy_pass http://ollama_backend; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; # High timeouts for LLM (generation can take 30s+) proxy_connect_timeout 60s; proxy_send_timeout 120s; proxy_read_timeout 120s; # Streaming support proxy_buffering off; proxy_cache off; } # Healthcheck endpoint location /health { access_log off; return 200 "OK\n"; add_header Content-Type text/plain; } }

Real Case: Startup Reducing API Costs by 80%

Context: TechDoc SaaS, technical documentation generation platform for developers. 2000 active users, 2.5M tokens/day input + output. Used GPT-4 Turbo for 12 months.

Problems encountered:

  • OpenAI API cost: $4200/month (75M tokens × $0.056/1M tokens average)
  • Network latency: 2-5s RTT to OpenAI API (servers in EU)
  • Rate limits: blocks at 500 req/min during traffic peaks
  • GDPR concerns: user data (proprietary code) sent to OpenAI US

Deployed solution:

  • Migration to Ollama + Llama 3.3 70B (Q8 quantization)
  • Infra: Hetzner AX102 server (2× RTX 4090, 128GB RAM, $89/month) + Load balancer ($20/month)
  • Migration time: 3 days (1 day infra config, 2 days quality tests)
  • Code changes: 8 lines modified (change base_url OpenAI SDK)

Results after 6 months:

MetricBefore (GPT-4 API)After (Ollama)Change
Monthly cost$4200$109 (server + backup)-97% ✅
Latency p503.2s1.8s-44% ✅
Latency p9912s (rate limits)4.1s-66% ✅
Output quality (human eval)92%89%-3% ⚠️
Availability99.7% (OpenAI SLA)99.95% (self-hosted)+0.25% ✅
Rate limit incidents12-15/month0-100% ✅

CTO feedback:

"The migration to Ollama was surprisingly simple. We saved $25,000 over 6 months while improving latency and eliminating rate limits. The slight quality drop (89% vs 92%) is imperceptible to our users — we measured via A/B test and identical NPS. For 80% of our use cases, Llama 3.3 is indistinguishable from GPT-4. We keep GPT-4 API only for 2-3% of ultra-complex requests (via automatic fallback). ROI: migration investment recovered in 2 weeks."

Production Best Practices

Monitoring and Alerts

# prometheus.yml: scraping Ollama metrics global: scrape_interval: 15s scrape_configs: # GPU metrics via NVIDIA DCGM - job_name: 'nvidia-gpu' static_configs: - targets: ['dcgm-exporter:9400'] # System metrics (node_exporter) - job_name: 'node' static_configs: - targets: ['node-exporter:9100'] # Custom Ollama metrics (via wrapper) - job_name: 'ollama' static_configs: - targets: ['ollama-exporter:8000'] # Critical alerts (alertmanager) # alerts.yml groups: - name: ollama rules: # GPU temperature > 85°C - alert: GPUOverheating expr: nvidia_gpu_temperature_celsius > 85 for: 5m annotations: summary: "GPU overheating on {{ $labels.instance }}" # VRAM utilization > 95% - alert: VRAMSaturation expr: nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes > 0.95 for: 2m annotations: summary: "VRAM near saturation on GPU {{ $labels.gpu }}" # Latency p95 > 10s - alert: HighLatency expr: histogram_quantile(0.95, ollama_request_duration_seconds) > 10 for: 5m annotations: summary: "High latency detected (p95 > 10s)" # Error rate > 5% - alert: HighErrorRate expr: rate(ollama_requests_failed_total[5m]) / rate(ollama_requests_total[5m]) > 0.05 annotations: summary: "Error rate above 5%"

Troubleshooting: Common Issues and Solutions

SymptomProbable CauseSolution
Very slow generation (>30s)Model too large for available RAM/VRAM, swap usedUse quantized version (Q4) or smaller model (8B instead of 70B)
"out of memory" errorInsufficient GPU VRAMSwitch to Q4 quantization or upgrade GPU (minimum 24GB for 70B)
GPU not detected (CPU fallback)Missing NVIDIA drivers or nvidia-docker not installedInstall CUDA toolkit + nvidia-docker, check nvidia-smi
Lower quality than expectedTemperature too high (excessive creativity) or unsuitable modelLower temperature (0.1-0.3 for factual tasks), try another model
Latency increases after 1h useGPU thermal throttling (>85°C)Improve cooling, reduce load (fewer concurrent workers)
First request takes 30-60sCold start: loading model into VRAMIncrease OLLAMA_KEEP_ALIVE (keep model loaded), or preload at startup

Resources and Training

To master deploying open-source LLMs in production and integrating Ollama into your applications, our Claude API for Developers training also covers open-source alternatives (Llama, Mistral), hybrid architectures (API + self-hosted), and migration strategies. 2-day training, OPCO eligible.

We also offer a specialized "Self-Hosted LLMs in Production" module (1 day) on Ollama, vLLM, and GPU optimizations. Contact us via the contact form.

Frequently Asked Questions

Is Ollama really free for commercial use?

Yes. Ollama is open-source (MIT license) and can be used commercially without restrictions. Models (Llama 3.3, Mistral, etc.) have permissive licenses allowing commercial use. Only constraint: you pay for infrastructure (GPU/CPU server). Typical cost: $50-200/month depending on volume vs $500-5000/month for equivalent proprietary APIs.

What's the difference between Ollama and an API like OpenAI/Claude?

Ollama runs models locally (on your machine or server), proprietary APIs are hosted by the provider. Ollama advantages: zero cost per token, 100% private data, no rate limits, works offline. Disadvantages: requires GPU infrastructure for optimal performance, quality inferior to best proprietary models (GPT-4, Claude Opus) on complex tasks.

Which models to choose for production in 2026?

For general use: Llama 3.3 70B (best quality/performance ratio, similar to GPT-4 Turbo). For critical latency: Llama 3.3 8B or Mistral 7B (responses <1s on CPU). For code: CodeLlama 34B or DeepSeek Coder 33B. For multilingual: Mistral Large 2 or Qwen2.5. Rule: use the smallest model that meets your quality criteria.

Can you deploy Ollama without GPU?

Yes, Ollama works on CPU but it's 10-50x slower. For CPU-only production: limit to 7B-13B quantized models (Q4_K_M) and accept 5-15s latency per response. For serious production: GPU required. Minimum viable: RTX 4090 24GB ($500 used) or NVIDIA L4 cloud ($0.50/h). For scale: A100 40GB ($2-3/h) or H100 ($4-6/h).

How to migrate from OpenAI API to Ollama without rewriting code?

Ollama is compatible with the OpenAI API. Only change the base URL (http://localhost:11434/v1) and model (llama3.3:70b). Your `client.chat.completions.create()` calls work as-is. Only difference: no native support for function calling (use Prompt Engineering or LangChain for tool use). Migration = 5 lines of code modified.

Train Your Team in AI

Our training programs are OPCO-eligible — potential out-of-pocket cost: $0.

View Training ProgramsCheck OPCO Eligibility