Talki Academy
⚙️

Running Local LLMs in Production: Ollama & vLLM Troubleshooting Guide

A hands-on operations guide for DevOps leads, ML engineers, and infrastructure architects running local inference at scale. You will build VRAM allocation strategies for multi-GPU setups, benchmark FP8 vs AWQ vs GGUF quantization formats, eliminate cold-load latency spikes, configure GPU failover and automatic recovery, and work through 10 production incident scenarios drawn from real Ollama and vLLM deployments — including OOM on unified memory, orphan llama-server cleanup, and Qwen3 think-block content issues.

Duration
2 days
Level
Advanced
Price
9.99 EUR/month (all courses included)
Max group
8 participants

What you will learn

+Calculate exact VRAM requirements for any model and quantization format
+Choose between FP8, AWQ, and GGUF Q-series based on hardware and quality targets
+Eliminate cold-load latency with keep-alive configuration and warmup services
+Build multi-GPU tier allocation strategies for mixed workloads
+Implement automatic GPU failover with health monitoring and Redis routing
+Diagnose and fix 10 common Ollama and vLLM production failures
+Compute ROI break-even between self-hosted and cloud inference

Course program

Module 1: VRAM Management and GPU Architecture

3h00
  • VRAM calculation formula: weights + KV cache + overhead
  • GPU tier allocation strategy for mixed workloads
  • Live VRAM profiling with nvidia-smi and rocm-smi
  • Workshop: calculate and optimize VRAM for your model

Module 2: Quantization Trade-offs: FP8, AWQ, GGUF

2h30
  • Format decision matrix: when to use each quantization
  • Benchmarking quality loss vs. VRAM savings
  • Downloading and importing quantized models into Ollama
  • Workshop: benchmark Q4_K_M vs Q8_0 on your hardware

Module 3: Latency Tuning and Cold-Load Optimization

3h00
  • Solving cold-load with keep-alive and warmup services
  • vLLM cold-load fix with llama-swap multi-backend routing
  • num_ctx impact on throughput: finding your optimal window
  • Workshop: measure and reduce TTFT to under 500ms

Module 4: GPU Failover and Recovery Patterns

2h30
  • Health check architecture with Redis routing
  • Resilient client with automatic endpoint failover
  • GPU hang detection and automated recovery scripts
  • Workshop: build a failover chain across GPU tiers

Module 5: Cost vs. Latency Decision Framework

2h00
  • Workload classification: interactive, batch, privacy-constrained
  • ROI calculator: break-even volume for self-hosted vs. cloud
  • Hardware investment analysis with 36-month amortization
  • Workshop: compute ROI for your actual workload

Module 6: Production Troubleshooting Playbook

3h00
  • Scenario 1–5: OOM, cold-load, orphan processes, empty content, MoE migration
  • Scenario 6–10: gibberish output, ROCm errors, cache thrashing, 503 errors, determinism
  • Diagnostics cheatsheet: one-liners for production incidents
  • Workshop: diagnose a simulated failing deployment end-to-end

Ready to get started?

9.99 EUR/month — All courses included, cancel anytime

Request a quoteView all courses