⚡

LLM Inference & Serving Optimization

Name: LLM Inference & Serving Optimization — 2026
Price: 9.99 EUR
Availability: InStock

An L300 advanced, hands-on course for engineers serving LLMs in production. Serving a model is expensive and slow until you understand what happens under the hood. This course takes apart the mechanics of inference — why it is memory-bound, how the KV-cache grows, where the VRAM actually goes — then equips you to optimize it: quantization (GPTQ, AWQ, GGUF, FP8), continuous batching and PagedAttention with vLLM, speculative decoding, Mixture-of-Experts, and long-context handling (RoPE/YaRN). You leave with reproducible benchmarks and a method for choosing a serving framework.

Duration

2 days

Level

Advanced

Price

9.99 EUR/month (all courses included)

Max group

12 participants

What you will learn

+Explain why LLM inference is bound by memory bandwidth (memory-bound), not raw compute

+Calculate required VRAM (weights + KV-cache) and anticipate the impact of context length

+Choose and apply a quantization method (GPTQ, AWQ, GGUF, FP8) along the quality/speed trade-off

+Stand up high-performance serving with continuous batching and PagedAttention (vLLM)

+Assess the gain from speculative decoding and understand the point of Mixture-of-Experts architectures

+Benchmark a deployment rigorously: TTFT, tokens/second, throughput under load

Course program

Module 1: The economics of inference: VRAM, attention, and the KV-cache

3h30

Why inference is memory-bound: memory bandwidth vs raw compute
The attention mechanism and the KV-cache: what is stored and why it grows with context
VRAM math: model weights + KV-cache (formula and reference table)
Measure what matters: TTFT (time-to-first-token), TPS (tokens/second), throughput
Workshop: estimate VRAM and latency for a 7B/13B model across context lengths

Module 2: Quantization: shrink the model without breaking quality

3h30

Formats: INT8, INT4, FP8, NF4 — quantizing weights and/or activations
Methods: GPTQ, AWQ, GGUF (llama.cpp), bitsandbytes — differences and use cases
The quality / size / speed trade-off and sensitivity by model architecture
Workshop: quantize a model and measure quality degradation vs VRAM and latency gains

Module 3: High-performance serving: continuous batching and PagedAttention

3h30

Continuous batching vs static batching: why throughput multiplies under real load
PagedAttention and KV-cache memory management in vLLM
Choosing a framework: vLLM, TGI, SGLang, llama.cpp — strengths and trade-offs
Workshop: deploy with vLLM and benchmark throughput under concurrent load

Module 4: Advanced techniques: speculative decoding, MoE, and long context

3h30

Speculative decoding: a small 'draft' model proposes multiple tokens verified in one pass
Mixture-of-Experts (MoE): sparse activation and why it scales
FlashAttention: reducing the memory footprint of attention
Long context: RoPE, YaRN, and the real cost of an extended context window
Workshop: enable speculative decoding and measure the end-to-end latency gain

Ready to get started?

9.99 EUR/month — All courses included, cancel anytime

Subscribe — 9.99 €/month View all courses