Ollama + Open WebUI: Deploy Open-Source LLMs in Productio...

In 2026, deploying LLMs in production without relying on proprietary APIs has become a priority for many companies. Between exploding API costs at scale ($500-5000/month for intensive use), network latency issues, and data confidentiality constraints, self-hosted open-source models represent a credible alternative.

Ollama drastically simplifies local LLM deployment: a single command to install Llama 3.3 70B, Mistral Large, CodeLlama, or DeepSeek. Open WebUI provides a ChatGPT-like interface running on your infrastructure. Together, they enable moving from $1500/month in API calls to $80/month for cloud servers — without sacrificing quality for 80% of use cases.

Why Ollama + Open-Source LLMs in 2026?

Cost Analysis: Proprietary APIs vs Self-Hosting

Real case: SaaS startup with 500 active users generating 1M tokens/day (content generation, support chatbot, summaries). Let's compare costs over 12 months.

Solution	Infra/month	Tokens/month	Total/month	Total/year
OpenAI GPT-4 Turbo	$0	$3000 (30M tokens)	$3000	$36,000
Claude Sonnet 4.5	$0	$900 (30M tokens)	$900	$10,800
Ollama + Llama 3.3 70B (GPU cloud)	$180 (L4 24GB)	$0	$180	$2,160
Ollama + Llama 3.3 70B (dedicated server)	$89 (Hetzner GPU)	$0	$89	$1,068
Ollama + Mistral 7B (CPU only)	$29 (VPS 16 vCPU)	$0	$29	$348

Achievable savings:

-94% cost reduction moving from Claude API to Ollama + Llama 3.3 on dedicated server ($10,800 → $1,068/year)
-99% cost reduction moving from GPT-4 to Ollama + Mistral 7B on VPS ($36,000 → $348/year)
Immediate ROI: payback in 1 month for volume > 100k tokens/day
Linear scalability: 10x more users = +1 GPU server ($180/month), not +$900/month in API calls

Ideal Use Cases for Ollama

Use Case	Recommended Model	Rationale
Internal customer support chatbot	Llama 3.3 8B	Sensitive data, no critical latency, high volume
Technical documentation generation	CodeLlama 34B	Code specialized, quality > latency, offline OK
Automatic meeting summaries	Mistral 7B	Simple task, very high volume, cost critical
Code assistant in IDE	DeepSeek Coder 33B	Best code quality, must be local (latency)
Contract analysis (confidential data)	Llama 3.3 70B	Strict GDPR, ultra-sensitive data, max quality
Support ticket classification	Mistral 7B (quantized Q4)	Simple task, <500ms latency required

Ollama Installation: macOS, Linux, Docker

macOS Installation (Apple Silicon M1/M2/M3)

Ollama leverages the integrated GPU of Apple Silicon chips via Metal. A Mac M3 Max 128GB can run Llama 3.3 70B at ~15 tokens/s.

# One-command installation
curl -fsSL https://ollama.com/install.sh | sh

# Check installation
ollama --version
# ollama version 0.3.14

# Start server (automatically runs in background)
ollama serve

# Download and run Llama 3.3 70B
ollama run llama3.3:70b

# First run: model download (~40GB)
# Then: interactive conversation starts
>>> Explain RAG in 3 simple sentences.

# Quick performance test
>>> /bye  # Exit conversation

# List downloaded models
ollama list
# NAME              SIZE      MODIFIED
# llama3.3:70b      40GB      2 minutes ago

Expected result: On Mac M3 Max, first response in ~8s, subsequent tokens at ~15 tok/s. RAM usage: ~50GB for 70B model.

Linux Installation (Ubuntu/Debian)

# Installation
curl -fsSL https://ollama.com/install.sh | sh

# If you have NVIDIA GPU, install CUDA drivers
# Check GPU detection
nvidia-smi

# Start Ollama
ollama serve

# In another terminal: download multiple models
ollama pull llama3.3:70b      # 40GB - best quality
ollama pull llama3.3:8b       # 4.7GB - faster
ollama pull mistral:7b        # 4.1GB - excellent multilingual
ollama pull codellama:34b     # 19GB - code specialized

# Comparative latency test
time ollama run llama3.3:8b "Summarize Docker in 2 sentences"
# real    0m2.341s  (~15 tokens/s on RTX 4090)

time ollama run llama3.3:70b "Summarize Docker in 2 sentences"
# real    0m5.127s  (~8 tokens/s on RTX 4090)

# Run Ollama as systemd service (production)
sudo systemctl enable ollama
sudo systemctl start ollama
sudo systemctl status ollama

Docker Installation (Multi-Platform Production)

# docker-compose.yml for Ollama + Open WebUI
version: '3.8'

services:
  # Ollama: model server
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama  # Model storage
    ports:
      - "11434:11434"
    environment:
      - OLLAMA_HOST=0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]  # Requires nvidia-docker
    restart: unless-stopped

  # Open WebUI: ChatGPT-like interface
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=false  # Or true with user management
    volumes:
      - open_webui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  open_webui_data:

# Start services
docker-compose up -d

# Wait for Ollama to be ready (~10s)
sleep 10

# Download models in container
docker exec -it ollama ollama pull llama3.3:70b
docker exec -it ollama ollama pull mistral:7b
docker exec -it ollama ollama pull codellama:34b

# Check logs
docker-compose logs -f ollama

# Access Open WebUI
# Open http://localhost:3000 in browser
# Interface ready, models available in dropdown

# GPU monitoring (if NVIDIA)
watch -n 1 nvidia-smi

# Expected: GPU utilization at ~90% during inference

Model Comparison: Llama, Mistral, CodeLlama, DeepSeek

Model	Size	RAM Required	Speed (RTX 4090)	Quality	Use Case
Llama 3.3 70B	40GB	48GB	8-12 tok/s	⭐⭐⭐⭐⭐	General use, max quality, similar to GPT-4 Turbo
Llama 3.3 8B	4.7GB	8GB	35-50 tok/s	⭐⭐⭐⭐	Critical latency, simple chatbots, CPU viable
Mistral 7B	4.1GB	8GB	40-60 tok/s	⭐⭐⭐⭐	Excellent multilingual, minimal cost, CPU OK
Mistral Large 2	123GB	140GB	4-6 tok/s	⭐⭐⭐⭐⭐	Top-tier multilingual, competes with GPT-4
CodeLlama 34B	19GB	24GB	12-18 tok/s	⭐⭐⭐⭐	Code generation, technical documentation
DeepSeek Coder 33B	18GB	24GB	14-20 tok/s	⭐⭐⭐⭐⭐	Best for Python/JS/TS code, better than CodeLlama
Qwen2.5 72B	41GB	48GB	7-11 tok/s	⭐⭐⭐⭐⭐	Multilingual (excellent Chinese), math, reasoning

Quality Benchmarks (MMLU, HumanEval, MT-Bench)

Scores on academic benchmarks (higher is better). MMLU = general knowledge, HumanEval = code generation, MT-Bench = multi-turn conversations.

Model	MMLU	HumanEval	MT-Bench	API Equivalent
Llama 3.3 70B	82.0%	69.5%	8.2/10	≈ GPT-4 Turbo, Claude Sonnet 3.5
DeepSeek Coder 33B	66.4%	78.6%	7.1/10	≈ GPT-3.5 Turbo (better code)
Mistral 7B	62.5%	40.2%	6.8/10	≈ GPT-3.5 Turbo
Llama 3.3 8B	68.4%	62.2%	7.4/10	≈ GPT-3.5 Turbo
Reference: GPT-4 Turbo	86.4%	67.0%	8.9/10	—
Reference: Claude Opus 4.5	88.7%	84.9%	9.0/10	—

Benchmark conclusion: Llama 3.3 70B achieves 95% of GPT-4 Turbo quality on most tasks. For 80% of production use cases, it's more than sufficient — especially when it costs $0 in tokens vs $3000/month.

Integrations: Python, REST API, OpenAI Compatibility

Native Python Integration (ollama-python)

# Installation
pip install ollama

# Example 1: Simple chat completion
import ollama

response = ollama.chat(
    model='llama3.3:70b',
    messages=[
        {
            'role': 'system',
            'content': 'You are a technical assistant expert in cloud computing.'
        },
        {
            'role': 'user',
            'content': 'Explain the difference between Kubernetes and Docker Swarm in 3 points.'
        }
    ]
)

print(response['message']['content'])

# Expected output:
# 1. **Complexity**: Kubernetes offers more features (auto-scaling,
#    rolling updates, service mesh) but requires more configuration.
#    Docker Swarm is simpler to start.
# 2. **Ecosystem**: Kubernetes dominates the industry (CNCF, cloud native support),
#    Docker Swarm is declining.
# 3. **Scale**: Kubernetes scales to thousands of nodes, Swarm suits
#    clusters <100 nodes.

# Example 2: Streaming (token-by-token responses)
import ollama

stream = ollama.chat(
    model='llama3.3:8b',
    messages=[{'role': 'user', 'content': 'Write a haiku about DevOps'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

# Output (progressive):
# Code deployed late
# Pipeline runs endlessly
# Coffee, logs, success

print()  # Final newline

# Example 3: Code generation with metrics
import ollama
import time

start = time.time()

response = ollama.chat(
    model='codellama:34b',
    messages=[
        {
            'role': 'user',
            'content': """Write a Python function that:
            1. Reads a CSV file
            2. Filters rows where 'status' column == 'active'
            3. Groups by 'category' and counts occurrences
            4. Returns a dict {category: count}

            Use pandas. Include error handling."""
        }
    ],
    options={
        'temperature': 0.2,  # Less creativity for code
        'top_p': 0.9
    }
)

elapsed = time.time() - start
code = response['message']['content']

print(code)
print(f"\n⏱️  Generated in {elapsed:.2f}s")
print(f"📊 {len(code.split())} words, {response['eval_count']} tokens")

# Expected output:
# import pandas as pd
# from typing import Dict
#
# def count_active_by_category(filepath: str) -> Dict[str, int]:
#     """
#     Count active entries by category from CSV.
#     ...
#     """
#     try:
#         df = pd.read_csv(filepath)
#         active_df = df[df['status'] == 'active']
#         counts = active_df.groupby('category').size().to_dict()
#         return counts
#     except FileNotFoundError:
#         raise ValueError(f"File not found: {filepath}")
#     except KeyError as e:
#         raise ValueError(f"Missing column: {e}")
#
# ⏱️  Generated in 8.3s
# 📊 142 words, 487 tokens

REST API: OpenAI Compatibility (drop-in replacement)

# Ollama exposes OpenAI-compatible API at /v1/chat/completions
# You can use the OpenAI SDK directly!

from openai import OpenAI

# Point to Ollama instead of OpenAI
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # Required by SDK but not used by Ollama
)

# IDENTICAL code to OpenAI API
response = client.chat.completions.create(
    model='llama3.3:70b',
    messages=[
        {
            'role': 'system',
            'content': 'You are a web security expert.'
        },
        {
            'role': 'user',
            'content': 'Explain XSS and give an exploit example + mitigation.'
        }
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

# Migration from OpenAI to Ollama = change 2 lines (base_url + model)
# All other code remains identical!

# Example: direct REST call with curl
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3:8b",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ],
    "temperature": 0.1
  }'

# JSON response (OpenAI format)
{
  "id": "chatcmpl-xyz",
  "object": "chat.completion",
  "created": 1735689600,
  "model": "llama3.3:8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "completion_tokens": 7,
    "total_tokens": 25
  }
}

LangChain Integration (RAG, Agents, Tool Use)

# Installation
pip install langchain langchain-community ollama

# RAG with Ollama: Q&A system on documentation
from langchain_community.llms import Ollama
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader

# 1. Load documentation
loader = TextLoader("docs/kubernetes-guide.txt")
documents = loader.load()

# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
texts = text_splitter.split_documents(documents)

# 3. Create embeddings (nomic-embed-text optimized for RAG)
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# 4. Index in ChromaDB
vectorstore = Chroma.from_documents(
    documents=texts,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 5. Create RAG chain
llm = Ollama(model="llama3.3:70b", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# 6. Ask questions
result = qa_chain({"query": "How to configure auto-scaling in Kubernetes?"})

print(result['result'])
print(f"\nSources: {len(result['source_documents'])} documents used")

# Output:
# To configure auto-scaling in Kubernetes, use a
# HorizontalPodAutoscaler (HPA). Define target metrics
# (CPU, memory or custom metrics) and min/max replica limits.
# Example: kubectl autoscale deployment nginx --cpu-percent=50 --min=2 --max=10
#
# Sources: 3 documents used

Production Deployment: Docker Compose, GPU, Load Balancing

Recommended Production Architecture

# docker-compose.production.yml
version: '3.8'

services:
  # NGINX: load balancer to distribute across multiple Ollama workers
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - ollama-worker-1
      - ollama-worker-2
    restart: unless-stopped

  # Ollama Worker 1 (GPU 0)
  ollama-worker-1:
    image: ollama/ollama:latest
    container_name: ollama-worker-1
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - CUDA_VISIBLE_DEVICES=0  # GPU 0
    volumes:
      - ollama_models:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped

  # Ollama Worker 2 (GPU 1)
  ollama-worker-2:
    image: ollama/ollama:latest
    container_name: ollama-worker-2
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - CUDA_VISIBLE_DEVICES=1  # GPU 1
    volumes:
      - ollama_models:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['1']
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped

  # Open WebUI: user interface
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://nginx:80
      - WEBUI_AUTH=true
      - WEBUI_JWT_SECRET_KEY=${JWT_SECRET}
    volumes:
      - open_webui_data:/app/backend/data
    depends_on:
      - nginx
    restart: unless-stopped

  # Prometheus: metrics monitoring
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    restart: unless-stopped

  # Grafana: dashboards
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
    depends_on:
      - prometheus
    restart: unless-stopped

volumes:
  ollama_models:
  open_webui_data:
  prometheus_data:
  grafana_data:

# nginx.conf: round-robin load balancing between workers
upstream ollama_backend {
    least_conn;  # Distribute to least loaded worker
    server ollama-worker-1:11434 max_fails=3 fail_timeout=30s;
    server ollama-worker-2:11434 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;

    location / {
        proxy_pass http://ollama_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # High timeouts for LLM (generation can take 30s+)
        proxy_connect_timeout 60s;
        proxy_send_timeout 120s;
        proxy_read_timeout 120s;

        # Streaming support
        proxy_buffering off;
        proxy_cache off;
    }

    # Healthcheck endpoint
    location /health {
        access_log off;
        return 200 "OK\n";
        add_header Content-Type text/plain;
    }
}

Real Case: Startup Reducing API Costs by 80%

Context: TechDoc SaaS, technical documentation generation platform for developers. 2000 active users, 2.5M tokens/day input + output. Used GPT-4 Turbo for 12 months.

Problems encountered:

OpenAI API cost: $4200/month (75M tokens × $0.056/1M tokens average)
Network latency: 2-5s RTT to OpenAI API (servers in EU)
Rate limits: blocks at 500 req/min during traffic peaks
GDPR concerns: user data (proprietary code) sent to OpenAI US

Deployed solution:

Migration to Ollama + Llama 3.3 70B (Q8 quantization)
Infra: Hetzner AX102 server (2× RTX 4090, 128GB RAM, $89/month) + Load balancer ($20/month)
Migration time: 3 days (1 day infra config, 2 days quality tests)
Code changes: 8 lines modified (change base_url OpenAI SDK)

Results after 6 months:

Metric	Before (GPT-4 API)	After (Ollama)	Change
Monthly cost	$4200	$109 (server + backup)	-97% ✅
Latency p50	3.2s	1.8s	-44% ✅
Latency p99	12s (rate limits)	4.1s	-66% ✅
Output quality (human eval)	92%	89%	-3% ⚠️
Availability	99.7% (OpenAI SLA)	99.95% (self-hosted)	+0.25% ✅
Rate limit incidents	12-15/month	0	-100% ✅

CTO feedback:

"The migration to Ollama was surprisingly simple. We saved $25,000 over 6 months while improving latency and eliminating rate limits. The slight quality drop (89% vs 92%) is imperceptible to our users — we measured via A/B test and identical NPS. For 80% of our use cases, Llama 3.3 is indistinguishable from GPT-4. We keep GPT-4 API only for 2-3% of ultra-complex requests (via automatic fallback). ROI: migration investment recovered in 2 weeks."

Production Best Practices

Monitoring and Alerts

# prometheus.yml: scraping Ollama metrics
global:
  scrape_interval: 15s

scrape_configs:
  # GPU metrics via NVIDIA DCGM
  - job_name: 'nvidia-gpu'
    static_configs:
      - targets: ['dcgm-exporter:9400']

  # System metrics (node_exporter)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  # Custom Ollama metrics (via wrapper)
  - job_name: 'ollama'
    static_configs:
      - targets: ['ollama-exporter:8000']

# Critical alerts (alertmanager)
# alerts.yml
groups:
  - name: ollama
    rules:
      # GPU temperature > 85°C
      - alert: GPUOverheating
        expr: nvidia_gpu_temperature_celsius > 85
        for: 5m
        annotations:
          summary: "GPU overheating on {{ $labels.instance }}"

      # VRAM utilization > 95%
      - alert: VRAMSaturation
        expr: nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes > 0.95
        for: 2m
        annotations:
          summary: "VRAM near saturation on GPU {{ $labels.gpu }}"

      # Latency p95 > 10s
      - alert: HighLatency
        expr: histogram_quantile(0.95, ollama_request_duration_seconds) > 10
        for: 5m
        annotations:
          summary: "High latency detected (p95 > 10s)"

      # Error rate > 5%
      - alert: HighErrorRate
        expr: rate(ollama_requests_failed_total[5m]) / rate(ollama_requests_total[5m]) > 0.05
        annotations:
          summary: "Error rate above 5%"

Troubleshooting: Common Issues and Solutions

Symptom	Probable Cause	Solution
Very slow generation (>30s)	Model too large for available RAM/VRAM, swap used	Use quantized version (Q4) or smaller model (8B instead of 70B)
"out of memory" error	Insufficient GPU VRAM	Switch to Q4 quantization or upgrade GPU (minimum 24GB for 70B)
GPU not detected (CPU fallback)	Missing NVIDIA drivers or nvidia-docker not installed	Install CUDA toolkit + nvidia-docker, check nvidia-smi
Lower quality than expected	Temperature too high (excessive creativity) or unsuitable model	Lower temperature (0.1-0.3 for factual tasks), try another model
Latency increases after 1h use	GPU thermal throttling (>85°C)	Improve cooling, reduce load (fewer concurrent workers)
First request takes 30-60s	Cold start: loading model into VRAM	Increase OLLAMA_KEEP_ALIVE (keep model loaded), or preload at startup

Resources and Training

To master deploying open-source LLMs in production and integrating Ollama into your applications, our Claude API for Developers training also covers open-source alternatives (Llama, Mistral), hybrid architectures (API + self-hosted), and migration strategies. 2-day training, OPCO eligible.

We also offer a specialized "Self-Hosted LLMs in Production" module (1 day) on Ollama, vLLM, and GPU optimizations. Contact us via the contact form.

Frequently Asked Questions

Is Ollama really free for commercial use?

Yes. Ollama is open-source (MIT license) and can be used commercially without restrictions. Models (Llama 3.3, Mistral, etc.) have permissive licenses allowing commercial use. Only constraint: you pay for infrastructure (GPU/CPU server). Typical cost: $50-200/month depending on volume vs $500-5000/month for equivalent proprietary APIs.

What's the difference between Ollama and an API like OpenAI/Claude?

Ollama runs models locally (on your machine or server), proprietary APIs are hosted by the provider. Ollama advantages: zero cost per token, 100% private data, no rate limits, works offline. Disadvantages: requires GPU infrastructure for optimal performance, quality inferior to best proprietary models (GPT-4, Claude Opus) on complex tasks.

Which models to choose for production in 2026?

For general use: Llama 3.3 70B (best quality/performance ratio, similar to GPT-4 Turbo). For critical latency: Llama 3.3 8B or Mistral 7B (responses <1s on CPU). For code: CodeLlama 34B or DeepSeek Coder 33B. For multilingual: Mistral Large 2 or Qwen2.5. Rule: use the smallest model that meets your quality criteria.

Can you deploy Ollama without GPU?

Yes, Ollama works on CPU but it's 10-50x slower. For CPU-only production: limit to 7B-13B quantized models (Q4_K_M) and accept 5-15s latency per response. For serious production: GPU required. Minimum viable: RTX 4090 24GB ($500 used) or NVIDIA L4 cloud ($0.50/h). For scale: A100 40GB ($2-3/h) or H100 ($4-6/h).

How to migrate from OpenAI API to Ollama without rewriting code?

Ollama is compatible with the OpenAI API. Only change the base URL (http://localhost:11434/v1) and model (llama3.3:70b). Your `client.chat.completions.create()` calls work as-is. Only difference: no native support for function calling (use Prompt Engineering or LangChain for tool use). Migration = 5 lines of code modified.

Ollama + Open WebUI: Deploy Open-Source LLMs in Production (2026)