Talki Academy
Guide9 min read

FP8 on GB10: Why We Abandoned It (and Chose NVFP4)

A negative-result write-up: 5 FP8 attempts on the DGX Spark's GB10, all ending in a memory leak (CUDA allocator + FlashInfer on unified memory). Why we switched to NVFP4, promoted to production on 2026-06-02.

By Talki Academy·Updated on June 4, 2026

Here is a write-up you rarely see: a negative result. We wanted to run our production model in FP8 on the DGX Spark’s GB10 (Grace-Blackwell, unified memory). After five attempts, we abandoned FP8 on this platform and switched to NVFP4. Here is why — and what it taught us.

What we wanted, and why

FP8 is the “obvious” format on Blackwell: native hardware support, a good quality/throughput trade-off, wide data-center adoption. On an always-on node, the goal was simple: better throughput than GGUF, near-intact quality, and a memory footprint compatible with the GB10’s unified pool.

The Spark’s quirk is precisely that unified memory (UMA): CPU and GPU share a single pool. That is its strength (no host↔device copy) and, as we will see, its trap.

The failure mode: a leak via unified memory

On every attempt, the same scenario: the server started, answered correctly… then its memory usage climbed request after request until it saturated the shared pool, followed by a crash. Not a clean load-time error — a progressive leak.

The cause sat at the intersection of three things:

  • The CUDA allocator on unified memory: blocks were not actually returned to the pool.
  • FlashInfer’s workspace (the attention buffers) growing without being recycled correctly in this UMA context.
  • No guardrail: cgroup does not bound a runaway on UMA, and the Linux scheduler does not distinguish CPU RAM from VRAM in that single pool — so nothing capped the drift before the crash.
Observed symptom (schematic): t0 : model loaded, ~23 GB used, answers OK t+Nq : ~+X GB every N requests, never returned ... : saturation of the unified CPU+GPU pool -> OOM / worker crash, no recovery cgroup: does not bound a runaway on unified memory Linux scheduler: CPU RAM and VRAM = same pool, indistinguishable

Five attempts, same wall

We varied everything we could: stack versions (vLLM / FlashInfer), KV-cache size, explicit memory limits, allocator options. All five configurations converged on the same leak. From there, the math is simple: continuing meant debugging an immature upstream path (FP8/FlashInfer on Blackwell UMA) instead of shipping. Decision: close FP8 on this platform.

The pivot that worked: NVFP4

We moved to NVFP4 quantization (4-bit, compressed-tensors format, Marlin kernels). Result: it fits the GB10 memory budget, stays stable over time (no leak), and delivers higher throughput. It was promoted to production on 2 June 2026 as the always-on format.

The counter-intuitive part: well-implemented 4-bit quantization beat FP8 on our hardware. Not because FP8 is “worse” in the absolute, but because the 4-bit path was mature and stable on this platform, where the FP8 path was not. See also our LLM benchmark and our local LLM in production guide.

FP8 vs NVFP4 on GB10 (our stack)

CriterionFP8NVFP4
Stability over time (UMA)❌ Progressive leak✅ Stable
Memory footprint➖ Heavier✅ Fits the pool
Throughput➖ Unmeasurable (crash)✅ Higher
Path maturity (our stack)❌ Immature on UMA✅ Mature
Production verdictAbandoned after 5 triesPromoted 2026-06-02

Three lessons

  • Unified memory changes the rules. A format that “works everywhere” else can leak on UMA. Test the format on your platform, not on the spec sheet.
  • A negative result has value. Five documented attempts beat a sixth blind one. Knowing when to stop is an engineering skill.
  • Validate on a frozen benchmark. Our NVFP4 verdict comes from a reproducible measurement, not an impression. That is what makes an infrastructure decision defensible.

This experience is ours: our hardware, our versions, our load. It does not condemn FP8 in general — it documents precisely where and why it failed for us, and the format that replaced it.

Train your team in AI

Our training is eligible for funding — potential out-of-pocket cost: €0.

See all coursesCheck eligibility