To serve a quantized LLM in production, three families of formats keep coming up: GGUF (llama.cpp), FP8 (vLLM, native on Blackwell/Hopper) and NVFP4 (vLLM, 4-bit). This is not a “benchmark of the year”: it is the experience report of our actual choice for an always-on node, and above all the reasoning behind it.
The three formats
GGUF — portability (llama.cpp)
GGUF is llama.cpp’s format. Its strength is that it runs almost everywhere: pure CPU, ROCm (AMD), Vulkan, Metal (Mac), with optional CPU offload. Quantizations Q4_K_M, Q5_K_M, Q8… Simple to deploy, ideal on heterogeneous or consumer hardware. Trade-off: at scale, throughput caps (CPU-side dequant, less optimal than dedicated GPU kernels). See our local LLM in production guide.
FP8 — the data-center format (vLLM)
FP8 (8-bit) has native hardware support on Hopper and Blackwell. On dedicated-memory GPUs, it offers an excellent quality/throughput trade-off. Its Achilles’ heel for us: on the GB10’s unified memory, it leaked (allocator + attention workspace), to the point we abandoned it on that platform. A failure specific to UMA, not a condemnation of FP8.
NVFP4 — the 4-bit that won (vLLM + Marlin)
NVFP4 is NVIDIA’s 4-bit format, served by vLLM with Marlin kernels. On Blackwell, it combines a small memory footprint with high throughput. It is the format we promoted to production on 2 June 2026, with multi-token speculative decoding (MTP) and an FP8 KV cache.
Comparison table (our Blackwell context)
| Criterion | GGUF | FP8 | NVFP4 |
|---|---|---|---|
| Hardware portability | ✅ Very broad | ➖ Hopper/Blackwell | ➖ Blackwell |
| Throughput at scale | ➖ Caps | ✅ High (dedicated mem) | ✅ Best (Blackwell) |
| Memory footprint | ✅ Low | ➖ Heavier | ✅ Low |
| Stability on unified memory | ✅ OK | ❌ Leak (for us) | ✅ Stable |
| Deployment ease | ✅ Simple | ➖ vLLM | ➖ vLLM |
| Sweet spot | Consumer / heterogeneous | Data-center dedicated mem | Blackwell at scale |
How we chose
Our context: a GB10 (Blackwell, unified memory), always-on, sustained volume node. Three parallel attempts:
- GGUF: worked, but throughput capped for our load.
- FP8: leaked on unified memory — abandoned after five attempts.
- NVFP4: held the memory budget, stayed stable, and delivered the best throughput.
There is no “best format”
Our NVFP4 verdict holds for our Blackwell hardware at scale. Elsewhere, the answer changes:
- Consumer or heterogeneous hardware (laptops, older cards, Mac) → GGUF stays the pragmatic default.
- Data-center dedicated memory (Hopper/Blackwell) → FP8 shines.
- Blackwell at scale, sustained volume → NVFP4.
The right format is a function of your hardware, your scale and your ops skill, not a universal truth. And it is validated on a frozen benchmark, with your model and your real load — see our LLM benchmark. That is what separates a defensible infrastructure decision from a format hunch.