⚙️
Running Local LLMs in Production: Ollama & vLLM Troubleshooting Guide
A hands-on operations guide for DevOps leads, ML engineers, and infrastructure architects running local inference at scale. You will build VRAM allocation strategies for multi-GPU setups, benchmark FP8 vs AWQ vs GGUF quantization formats, eliminate cold-load latency spikes, configure GPU failover and automatic recovery, and work through 10 production incident scenarios drawn from real Ollama and vLLM deployments — including OOM on unified memory, orphan llama-server cleanup, and Qwen3 think-block content issues.
Duration
2 days
Level
Advanced
Price
9.99 EUR/month (all courses included)
Max group
8 participants
What you will learn
+Calculate exact VRAM requirements for any model and quantization format
+Choose between FP8, AWQ, and GGUF Q-series based on hardware and quality targets
+Eliminate cold-load latency with keep-alive configuration and warmup services
+Build multi-GPU tier allocation strategies for mixed workloads
+Implement automatic GPU failover with health monitoring and Redis routing
+Diagnose and fix 10 common Ollama and vLLM production failures
+Compute ROI break-even between self-hosted and cloud inference
Course program
Module 1: VRAM Management and GPU Architecture
3h00- VRAM calculation formula: weights + KV cache + overhead
- GPU tier allocation strategy for mixed workloads
- Live VRAM profiling with nvidia-smi and rocm-smi
- Workshop: calculate and optimize VRAM for your model
Module 2: Quantization Trade-offs: FP8, AWQ, GGUF
2h30- Format decision matrix: when to use each quantization
- Benchmarking quality loss vs. VRAM savings
- Downloading and importing quantized models into Ollama
- Workshop: benchmark Q4_K_M vs Q8_0 on your hardware
Module 3: Latency Tuning and Cold-Load Optimization
3h00- Solving cold-load with keep-alive and warmup services
- vLLM cold-load fix with llama-swap multi-backend routing
- num_ctx impact on throughput: finding your optimal window
- Workshop: measure and reduce TTFT to under 500ms
Module 4: GPU Failover and Recovery Patterns
2h30- Health check architecture with Redis routing
- Resilient client with automatic endpoint failover
- GPU hang detection and automated recovery scripts
- Workshop: build a failover chain across GPU tiers
Module 5: Cost vs. Latency Decision Framework
2h00- Workload classification: interactive, batch, privacy-constrained
- ROI calculator: break-even volume for self-hosted vs. cloud
- Hardware investment analysis with 36-month amortization
- Workshop: compute ROI for your actual workload
Module 6: Production Troubleshooting Playbook
3h00- Scenario 1–5: OOM, cold-load, orphan processes, empty content, MoE migration
- Scenario 6–10: gibberish output, ROCm errors, cache thrashing, 503 errors, determinism
- Diagnostics cheatsheet: one-liners for production incidents
- Workshop: diagnose a simulated failing deployment end-to-end
Ready to get started?
9.99 EUR/month — All courses included, cancel anytime