⚡
LLM Inference & Serving Optimization
An L300 advanced, hands-on course for engineers serving LLMs in production. Serving a model is expensive and slow until you understand what happens under the hood. This course takes apart the mechanics of inference — why it is memory-bound, how the KV-cache grows, where the VRAM actually goes — then equips you to optimize it: quantization (GPTQ, AWQ, GGUF, FP8), continuous batching and PagedAttention with vLLM, speculative decoding, Mixture-of-Experts, and long-context handling (RoPE/YaRN). You leave with reproducible benchmarks and a method for choosing a serving framework.
Duration
2 days
Level
Advanced
Price
9.99 EUR/month (all courses included)
Max group
12 participants
What you will learn
+Explain why LLM inference is bound by memory bandwidth (memory-bound), not raw compute
+Calculate required VRAM (weights + KV-cache) and anticipate the impact of context length
+Choose and apply a quantization method (GPTQ, AWQ, GGUF, FP8) along the quality/speed trade-off
+Stand up high-performance serving with continuous batching and PagedAttention (vLLM)
+Assess the gain from speculative decoding and understand the point of Mixture-of-Experts architectures
+Benchmark a deployment rigorously: TTFT, tokens/second, throughput under load
Course program
Module 1: The economics of inference: VRAM, attention, and the KV-cache
3h30- Why inference is memory-bound: memory bandwidth vs raw compute
- The attention mechanism and the KV-cache: what is stored and why it grows with context
- VRAM math: model weights + KV-cache (formula and reference table)
- Measure what matters: TTFT (time-to-first-token), TPS (tokens/second), throughput
- Workshop: estimate VRAM and latency for a 7B/13B model across context lengths
Module 2: Quantization: shrink the model without breaking quality
3h30- Formats: INT8, INT4, FP8, NF4 — quantizing weights and/or activations
- Methods: GPTQ, AWQ, GGUF (llama.cpp), bitsandbytes — differences and use cases
- The quality / size / speed trade-off and sensitivity by model architecture
- Workshop: quantize a model and measure quality degradation vs VRAM and latency gains
Module 3: High-performance serving: continuous batching and PagedAttention
3h30- Continuous batching vs static batching: why throughput multiplies under real load
- PagedAttention and KV-cache memory management in vLLM
- Choosing a framework: vLLM, TGI, SGLang, llama.cpp — strengths and trade-offs
- Workshop: deploy with vLLM and benchmark throughput under concurrent load
Module 4: Advanced techniques: speculative decoding, MoE, and long context
3h30- Speculative decoding: a small 'draft' model proposes multiple tokens verified in one pass
- Mixture-of-Experts (MoE): sparse activation and why it scales
- FlashAttention: reducing the memory footprint of attention
- Long context: RoPE, YaRN, and the real cost of an extended context window
- Workshop: enable speculative decoding and measure the end-to-end latency gain
Ready to get started?
9.99 EUR/month — All courses included, cancel anytime