Here is a write-up you rarely see: a negative result. We wanted to run our production model in FP8 on the DGX Spark’s GB10 (Grace-Blackwell, unified memory). After five attempts, we abandoned FP8 on this platform and switched to NVFP4. Here is why — and what it taught us.
What we wanted, and why
FP8 is the “obvious” format on Blackwell: native hardware support, a good quality/throughput trade-off, wide data-center adoption. On an always-on node, the goal was simple: better throughput than GGUF, near-intact quality, and a memory footprint compatible with the GB10’s unified pool.
The Spark’s quirk is precisely that unified memory (UMA): CPU and GPU share a single pool. That is its strength (no host↔device copy) and, as we will see, its trap.
The failure mode: a leak via unified memory
On every attempt, the same scenario: the server started, answered correctly… then its memory usage climbed request after request until it saturated the shared pool, followed by a crash. Not a clean load-time error — a progressive leak.
The cause sat at the intersection of three things:
- The CUDA allocator on unified memory: blocks were not actually returned to the pool.
- FlashInfer’s workspace (the attention buffers) growing without being recycled correctly in this UMA context.
- No guardrail: cgroup does not bound a runaway on UMA, and the Linux scheduler does not distinguish CPU RAM from VRAM in that single pool — so nothing capped the drift before the crash.
Five attempts, same wall
We varied everything we could: stack versions (vLLM / FlashInfer), KV-cache size, explicit memory limits, allocator options. All five configurations converged on the same leak. From there, the math is simple: continuing meant debugging an immature upstream path (FP8/FlashInfer on Blackwell UMA) instead of shipping. Decision: close FP8 on this platform.
The pivot that worked: NVFP4
We moved to NVFP4 quantization (4-bit, compressed-tensors format, Marlin kernels). Result: it fits the GB10 memory budget, stays stable over time (no leak), and delivers higher throughput. It was promoted to production on 2 June 2026 as the always-on format.
The counter-intuitive part: well-implemented 4-bit quantization beat FP8 on our hardware. Not because FP8 is “worse” in the absolute, but because the 4-bit path was mature and stable on this platform, where the FP8 path was not. See also our LLM benchmark and our local LLM in production guide.
FP8 vs NVFP4 on GB10 (our stack)
| Criterion | FP8 | NVFP4 |
|---|---|---|
| Stability over time (UMA) | ❌ Progressive leak | ✅ Stable |
| Memory footprint | ➖ Heavier | ✅ Fits the pool |
| Throughput | ➖ Unmeasurable (crash) | ✅ Higher |
| Path maturity (our stack) | ❌ Immature on UMA | ✅ Mature |
| Production verdict | Abandoned after 5 tries | Promoted 2026-06-02 |
Three lessons
- Unified memory changes the rules. A format that “works everywhere” else can leak on UMA. Test the format on your platform, not on the spec sheet.
- A negative result has value. Five documented attempts beat a sixth blind one. Knowing when to stop is an engineering skill.
- Validate on a frozen benchmark. Our NVFP4 verdict comes from a reproducible measurement, not an impression. That is what makes an infrastructure decision defensible.
This experience is ours: our hardware, our versions, our load. It does not condemn FP8 in general — it documents precisely where and why it failed for us, and the format that replaced it.