Talki Academy

LLM Inference & Serving Optimization

An L300 advanced, hands-on course for engineers serving LLMs in production. Serving a model is expensive and slow until you understand what happens under the hood. This course takes apart the mechanics of inference — why it is memory-bound, how the KV-cache grows, where the VRAM actually goes — then equips you to optimize it: quantization (GPTQ, AWQ, GGUF, FP8), continuous batching and PagedAttention with vLLM, speculative decoding, Mixture-of-Experts, and long-context handling (RoPE/YaRN). You leave with reproducible benchmarks and a method for choosing a serving framework.

Duration
2 days
Level
Advanced
Price
9.99 EUR/month (all courses included)
Max group
12 participants

What you will learn

+Explain why LLM inference is bound by memory bandwidth (memory-bound), not raw compute
+Calculate required VRAM (weights + KV-cache) and anticipate the impact of context length
+Choose and apply a quantization method (GPTQ, AWQ, GGUF, FP8) along the quality/speed trade-off
+Stand up high-performance serving with continuous batching and PagedAttention (vLLM)
+Assess the gain from speculative decoding and understand the point of Mixture-of-Experts architectures
+Benchmark a deployment rigorously: TTFT, tokens/second, throughput under load

Course program

Module 1: The economics of inference: VRAM, attention, and the KV-cache

3h30
  • Why inference is memory-bound: memory bandwidth vs raw compute
  • The attention mechanism and the KV-cache: what is stored and why it grows with context
  • VRAM math: model weights + KV-cache (formula and reference table)
  • Measure what matters: TTFT (time-to-first-token), TPS (tokens/second), throughput
  • Workshop: estimate VRAM and latency for a 7B/13B model across context lengths

Module 2: Quantization: shrink the model without breaking quality

3h30
  • Formats: INT8, INT4, FP8, NF4 — quantizing weights and/or activations
  • Methods: GPTQ, AWQ, GGUF (llama.cpp), bitsandbytes — differences and use cases
  • The quality / size / speed trade-off and sensitivity by model architecture
  • Workshop: quantize a model and measure quality degradation vs VRAM and latency gains

Module 3: High-performance serving: continuous batching and PagedAttention

3h30
  • Continuous batching vs static batching: why throughput multiplies under real load
  • PagedAttention and KV-cache memory management in vLLM
  • Choosing a framework: vLLM, TGI, SGLang, llama.cpp — strengths and trade-offs
  • Workshop: deploy with vLLM and benchmark throughput under concurrent load

Module 4: Advanced techniques: speculative decoding, MoE, and long context

3h30
  • Speculative decoding: a small 'draft' model proposes multiple tokens verified in one pass
  • Mixture-of-Experts (MoE): sparse activation and why it scales
  • FlashAttention: reducing the memory footprint of attention
  • Long context: RoPE, YaRN, and the real cost of an extended context window
  • Workshop: enable speculative decoding and measure the end-to-end latency gain

Ready to get started?

9.99 EUR/month — All courses included, cancel anytime

Subscribe — 9.99 €/monthView all courses