Talki Academy
Guide9 min read

NVFP4 vs FP8 vs GGUF: Which Quantization Format in Production

Experience report: the 3 serving formats for a quantized LLM (GGUF/llama.cpp, FP8/vLLM, NVFP4/vLLM) compared on portability, throughput, memory, stability. Why we promoted NVFP4 to prod on Blackwell (GB10) on 2026-06-02, and when GGUF or FP8 remain the right choice.

By Talki Academy·Updated on June 4, 2026

To serve a quantized LLM in production, three families of formats keep coming up: GGUF (llama.cpp), FP8 (vLLM, native on Blackwell/Hopper) and NVFP4 (vLLM, 4-bit). This is not a “benchmark of the year”: it is the experience report of our actual choice for an always-on node, and above all the reasoning behind it.

The three formats

GGUF — portability (llama.cpp)

GGUF is llama.cpp’s format. Its strength is that it runs almost everywhere: pure CPU, ROCm (AMD), Vulkan, Metal (Mac), with optional CPU offload. Quantizations Q4_K_M, Q5_K_M, Q8… Simple to deploy, ideal on heterogeneous or consumer hardware. Trade-off: at scale, throughput caps (CPU-side dequant, less optimal than dedicated GPU kernels). See our local LLM in production guide.

FP8 — the data-center format (vLLM)

FP8 (8-bit) has native hardware support on Hopper and Blackwell. On dedicated-memory GPUs, it offers an excellent quality/throughput trade-off. Its Achilles’ heel for us: on the GB10’s unified memory, it leaked (allocator + attention workspace), to the point we abandoned it on that platform. A failure specific to UMA, not a condemnation of FP8.

NVFP4 — the 4-bit that won (vLLM + Marlin)

NVFP4 is NVIDIA’s 4-bit format, served by vLLM with Marlin kernels. On Blackwell, it combines a small memory footprint with high throughput. It is the format we promoted to production on 2 June 2026, with multi-token speculative decoding (MTP) and an FP8 KV cache.

Comparison table (our Blackwell context)

CriterionGGUFFP8NVFP4
Hardware portability✅ Very broad➖ Hopper/Blackwell➖ Blackwell
Throughput at scale➖ Caps✅ High (dedicated mem)✅ Best (Blackwell)
Memory footprint✅ Low➖ Heavier✅ Low
Stability on unified memory✅ OK❌ Leak (for us)✅ Stable
Deployment ease✅ Simple➖ vLLM➖ vLLM
Sweet spotConsumer / heterogeneousData-center dedicated memBlackwell at scale

How we chose

Our context: a GB10 (Blackwell, unified memory), always-on, sustained volume node. Three parallel attempts:

  • GGUF: worked, but throughput capped for our load.
  • FP8: leaked on unified memory — abandoned after five attempts.
  • NVFP4: held the memory budget, stayed stable, and delivered the best throughput.
Chosen production config (NVFP4, vLLM): - quantization: NVFP4 (4-bit, Marlin kernels) - speculative decoding: MTP, n=3 - KV cache: FP8 - max context length: 262144 - rollback ready: BF16 path disabled, re-enableable Promoted to production: 2026-06-02

There is no “best format”

Our NVFP4 verdict holds for our Blackwell hardware at scale. Elsewhere, the answer changes:

  • Consumer or heterogeneous hardware (laptops, older cards, Mac) → GGUF stays the pragmatic default.
  • Data-center dedicated memory (Hopper/Blackwell) → FP8 shines.
  • Blackwell at scale, sustained volume → NVFP4.

The right format is a function of your hardware, your scale and your ops skill, not a universal truth. And it is validated on a frozen benchmark, with your model and your real load — see our LLM benchmark. That is what separates a defensible infrastructure decision from a format hunch.

Train your team in AI

Our training is eligible for funding — potential out-of-pocket cost: €0.

See all coursesCheck eligibility