NVFP4, FP8, GGUF: what’s the difference in one sentence?

GGUF is llama.cpp’s format: highly portable (CPU, ROCm, Vulkan, Metal), ideal for heterogeneous or consumer hardware. FP8 is an 8-bit format served by vLLM with native hardware support on Hopper/Blackwell dedicated-memory GPUs. NVFP4 is NVIDIA’s 4-bit format served by vLLM with Marlin kernels: higher throughput and a smaller memory footprint on Blackwell. None is “best” in the absolute — it depends on hardware, scale and your ops skill.

Why did you choose NVFP4 in production?

On our always-on node (GB10, Blackwell, unified memory), at sustained volume, NVFP4 won on three criteria at once: it fits the memory budget, stays stable over time, and delivers the best throughput (further improved by multi-token speculative decoding). FP8 leaked on that unified memory; GGUF worked but capped on throughput. NVFP4 was promoted on 2 June 2026.

Not at all. GGUF remains the most pragmatic default on heterogeneous or consumer hardware: it runs where vLLM doesn’t (pure CPU, older cards, Metal on Mac), and it is simple to deploy. Our NVFP4 verdict applies to our Blackwell hardware at scale — not to a laptop or a two-generations-old RTX, where GGUF is often the right tool.

When is FP8 the right choice?

On dedicated-memory Blackwell or Hopper (data-center), FP8 offers an excellent quality/throughput trade-off with native hardware support. Our FP8 failure was specific to the GB10’s unified memory (a leak via the allocator and the attention workspace), not a condemnation of FP8 in general.

How do you decide without getting it wrong?

Measure on a frozen benchmark, with your model, your hardware and your real load. Spec sheets and generic benchmarks do not predict behaviour on your memory, your versions and your request profile. A sound infrastructure decision comes from a reproducible measurement, not a format hunch.

NVFP4 vs FP8 vs GGUF: Which Format to Serve an LLM in Prod | Talki Academy

To serve a quantized LLM in production, three families of formats keep coming up: GGUF (llama.cpp), FP8 (vLLM, native on Blackwell/Hopper) and NVFP4 (vLLM, 4-bit). This is not a “benchmark of the year”: it is the experience report of our actual choice for an always-on node, and above all the reasoning behind it.

The three formats

GGUF — portability (llama.cpp)

GGUF is llama.cpp’s format. Its strength is that it runs almost everywhere: pure CPU, ROCm (AMD), Vulkan, Metal (Mac), with optional CPU offload. Quantizations Q4_K_M, Q5_K_M, Q8… Simple to deploy, ideal on heterogeneous or consumer hardware. Trade-off: at scale, throughput caps (CPU-side dequant, less optimal than dedicated GPU kernels). See our local LLM in production guide.

FP8 — the data-center format (vLLM)

FP8 (8-bit) has native hardware support on Hopper and Blackwell. On dedicated-memory GPUs, it offers an excellent quality/throughput trade-off. Its Achilles’ heel for us: on the GB10’s unified memory, it leaked (allocator + attention workspace), to the point we abandoned it on that platform. A failure specific to UMA, not a condemnation of FP8.

NVFP4 — the 4-bit that won (vLLM + Marlin)

NVFP4 is NVIDIA’s 4-bit format, served by vLLM with Marlin kernels. On Blackwell, it combines a small memory footprint with high throughput. It is the format we promoted to production on 2 June 2026, with multi-token speculative decoding (MTP) and an FP8 KV cache.

Comparison table (our Blackwell context)

Criterion	GGUF	FP8	NVFP4
Hardware portability	✅ Very broad	➖ Hopper/Blackwell	➖ Blackwell
Throughput at scale	➖ Caps	✅ High (dedicated mem)	✅ Best (Blackwell)
Memory footprint	✅ Low	➖ Heavier	✅ Low
Stability on unified memory	✅ OK	❌ Leak (for us)	✅ Stable
Deployment ease	✅ Simple	➖ vLLM	➖ vLLM
Sweet spot	Consumer / heterogeneous	Data-center dedicated mem	Blackwell at scale

How we chose

Our context: a GB10 (Blackwell, unified memory), always-on, sustained volume node. Three parallel attempts:

GGUF: worked, but throughput capped for our load.
FP8: leaked on unified memory — abandoned after five attempts.
NVFP4: held the memory budget, stayed stable, and delivered the best throughput.

Chosen production config (NVFP4, vLLM):
  - quantization: NVFP4 (4-bit, Marlin kernels)
  - speculative decoding: MTP, n=3
  - KV cache: FP8
  - max context length: 262144
  - rollback ready: BF16 path disabled, re-enableable

Promoted to production: 2026-06-02

There is no “best format”

Our NVFP4 verdict holds for our Blackwell hardware at scale. Elsewhere, the answer changes:

Consumer or heterogeneous hardware (laptops, older cards, Mac) → GGUF stays the pragmatic default.
Data-center dedicated memory (Hopper/Blackwell) → FP8 shines.
Blackwell at scale, sustained volume → NVFP4.

The right format is a function of your hardware, your scale and your ops skill, not a universal truth. And it is validated on a frozen benchmark, with your model and your real load — see our LLM benchmark. That is what separates a defensible infrastructure decision from a format hunch.

NVFP4 vs FP8 vs GGUF: Which Quantization Format in Production