Doesn’t FP8 work on Blackwell GPUs?

FP8 works very well on dedicated-memory Blackwell (data-center) GPUs. Our problem is specific to the DGX Spark’s GB10, which uses unified memory (UMA) shared between CPU and GPU. It was the combination of FP8 + the CUDA allocator + the FlashInfer workspace on unified memory that proved unstable for us, not FP8 itself.

What exactly was crashing?

A progressive memory leak. On unified memory, the CUDA allocator and FlashInfer’s attention workspace were not releasing properly, and usage climbed request after request until the shared pool was exhausted, then crashed. cgroup does not protect against a runaway on UMA, and the Linux scheduler does not distinguish CPU RAM from VRAM in that single pool.

How many attempts before giving up?

Five distinct configurations (vLLM/FlashInfer versions, KV-cache sizes, memory limits, allocator options). All ended in the same leak. At that point, persevering meant debugging an immature upstream path rather than shipping value, so we closed the FP8 effort on this platform.

What did you replace it with?

NVFP4 quantization (4-bit, compressed-tensors + Marlin kernels), promoted to production on 2 June 2026. It fits the GB10 memory budget, stays stable over time, and delivers higher throughput. Well-implemented 4-bit beat FP8 on our hardware — the opposite of the usual intuition.

So should you not buy a DGX Spark?

No. The GB10 is an excellent always-on inference node provided you pick the right format. Our conclusion is not ‘the GB10 is bad’ but ‘on unified memory, target NVFP4 over FP8, and validate your format on a frozen benchmark before production’.

FP8 on GB10: Why We Abandoned It After 5 Tries | Talki Academy

Here is a write-up you rarely see: a negative result. We wanted to run our production model in FP8 on the DGX Spark’s GB10 (Grace-Blackwell, unified memory). After five attempts, we abandoned FP8 on this platform and switched to NVFP4. Here is why — and what it taught us.

What we wanted, and why

FP8 is the “obvious” format on Blackwell: native hardware support, a good quality/throughput trade-off, wide data-center adoption. On an always-on node, the goal was simple: better throughput than GGUF, near-intact quality, and a memory footprint compatible with the GB10’s unified pool.

The Spark’s quirk is precisely that unified memory (UMA): CPU and GPU share a single pool. That is its strength (no host↔device copy) and, as we will see, its trap.

The failure mode: a leak via unified memory

On every attempt, the same scenario: the server started, answered correctly… then its memory usage climbed request after request until it saturated the shared pool, followed by a crash. Not a clean load-time error — a progressive leak.

The cause sat at the intersection of three things:

The CUDA allocator on unified memory: blocks were not actually returned to the pool.
FlashInfer’s workspace (the attention buffers) growing without being recycled correctly in this UMA context.
No guardrail: cgroup does not bound a runaway on UMA, and the Linux scheduler does not distinguish CPU RAM from VRAM in that single pool — so nothing capped the drift before the crash.

Observed symptom (schematic):
  t0   : model loaded, ~23 GB used, answers OK
  t+Nq : ~+X GB every N requests, never returned
  ...  : saturation of the unified CPU+GPU pool
  -> OOM / worker crash, no recovery

cgroup: does not bound a runaway on unified memory
Linux scheduler: CPU RAM and VRAM = same pool, indistinguishable

Five attempts, same wall

We varied everything we could: stack versions (vLLM / FlashInfer), KV-cache size, explicit memory limits, allocator options. All five configurations converged on the same leak. From there, the math is simple: continuing meant debugging an immature upstream path (FP8/FlashInfer on Blackwell UMA) instead of shipping. Decision: close FP8 on this platform.

The pivot that worked: NVFP4

We moved to NVFP4 quantization (4-bit, compressed-tensors format, Marlin kernels). Result: it fits the GB10 memory budget, stays stable over time (no leak), and delivers higher throughput. It was promoted to production on 2 June 2026 as the always-on format.

The counter-intuitive part: well-implemented 4-bit quantization beat FP8 on our hardware. Not because FP8 is “worse” in the absolute, but because the 4-bit path was mature and stable on this platform, where the FP8 path was not. See also our LLM benchmark and our local LLM in production guide.

FP8 vs NVFP4 on GB10 (our stack)

Criterion	FP8	NVFP4
Stability over time (UMA)	❌ Progressive leak	✅ Stable
Memory footprint	➖ Heavier	✅ Fits the pool
Throughput	➖ Unmeasurable (crash)	✅ Higher
Path maturity (our stack)	❌ Immature on UMA	✅ Mature
Production verdict	Abandoned after 5 tries	Promoted 2026-06-02

Three lessons

Unified memory changes the rules. A format that “works everywhere” else can leak on UMA. Test the format on your platform, not on the spec sheet.
A negative result has value. Five documented attempts beat a sixth blind one. Knowing when to stop is an engineering skill.
Validate on a frozen benchmark. Our NVFP4 verdict comes from a reproducible measurement, not an impression. That is what makes an infrastructure decision defensible.

This experience is ours: our hardware, our versions, our load. It does not condemn FP8 in general — it documents precisely where and why it failed for us, and the format that replaced it.

FP8 on GB10: Why We Abandoned It (and Chose NVFP4)