Gemma-4-31B NVFP4 on a Single DGX Spark: When the Quantization Is the Bottleneck
Gemma-4-31B is the reasoning half of my plan to complement Qwen on a single DGX Spark: Qwen drives general and vision work, Gemma handles math and step-by-step reasoning. NVIDIA ships it in NVFP4, a Blackwell-native 4-bit format, which fits ~30GB into the Spark’s 128GB unified memory with room left. The model is good. The speed story is where the marketing and the silicon disagree, and it is the more useful article. I measured single-stream, one GB10, same harness as everything else.
Verdict at a glance
| What it is | dense 31B, text-only, NVFP4 (modelopt) quant, TRITON_ATTN forced by head-dim geometry |
| Single-Spark decode | 6.8 tok/s (median). The same-family 26B-A4B MoE does 36.9 tok/s plain and 53.7 with FP8 KV + MTP on the same box, so this dense build is the slow one. |
| The ceiling | a dense 31B is bandwidth-bound at ~4.4 tok/s (31e9 params x 2 bytes / 273 GB/s); quant only claws back the weight read. 6.8 tok/s is right at that physics line |
| As a reasoner | 7/7 on the hard probe (ties Qwen), but at 6.8 tok/s. The 26B MoE gets the same 7/7 at 5x the speed, so the dense build has no edge to justify the wait. |
| Memory | ~68GB used / ~53GB free at util 0.50; cold boot ~306s (torch.compile + graph capture) |
| Spark gotchas | default FP4 kernel path is broken on sm_121 (force Marlin); util 0.80 OOMs the desktop; ~5min cold compile; vLLM auto-forces TRITON_ATTN |
| Do I run it? | Not the dense build. It reasons well (7/7) but at 6.8 tok/s. Run the 26B-A4B MoE instead: same 7/7, 5x or more the speed. Or skip a dedicated Gemma reasoner entirely if your general model already aces your reasoning load, mine does. |
The ceiling is physics, then a broken kernel on top
Single-stream decode on the GB10 is bound by memory bandwidth: every token streams the active weights across a ~273 GB/s bus. A dense 31B activates all 31B parameters per token. At 2 bytes each that is 31e9 x 2 / 273e9 ≈ 4.4 tok/s as a hard BF16 ceiling. Quantization helps the weight read, not the attention compute, so NVFP4 lands a little above that, not multiples above it. This is the same active-parameter lesson the 120B Nemotron teardown made: on a Spark, decode speed tracks active parameters, and a dense model has nowhere to hide.
On top of the physics sits a kernel bug. On sm_121 the default NVFP4 GEMM path (CUTLASS / FlashInfer FP4) is missing the tensor-core instructions it expects, emits unsupported PTX, and silently falls back to a slow path. The fix the DGX-Spark community converged on (see also the Sggin1 NVFP4 guide) is to force the Marlin kernel, which dequantizes FP4 to BF16 in-kernel and actually runs. Measured here: the dense build decodes at 6.8 tok/s single-stream, sitting almost exactly on the 4.4 tok/s bandwidth floor once NVFP4 claws a little back. I did not chase the Marlin A/B on the dense build in the end: the community reports it buys roughly +16%, which does not change the verdict, because the ceiling is bandwidth, not the kernel. The MoE (next section) is the real fix, so the dense build is being retired rather than micro-optimized.
Bring-up: it booted, it was just slow to compile
Unlike GLM, Gemma did not need a flag fight. Three things to know:
- vLLM forces the attention backend for you. Gemma-4 has heterogeneous head dimensions (256 local, 512 global), and vLLM logs “Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.” Setting
VLLM_ATTENTION_BACKENDyourself is pointless: it gets logged as an unknown variable and ignored. TRITON_ATTN is the only backend that handles these head dims, and it is not the bottleneck anyway. - The cold start is long. torch.compile plus CUDA-graph capture took about five minutes on first boot. My initial diagnostic poll gave up at four minutes and reported a false timeout; the model was fine, still compiling. Persisting the compile cache across mutex swaps (a host-mounted
VLLM_CACHE_ROOT) keeps that recompile out of subsequent boots, as long as the launch flags stay byte-identical (the cache key hashes them). - The util tax is identical to GLM. 0.80 reserves ~97GB of unified memory and OOMs the desktop. At 0.50 the box sits at ~68GB used with ~53GB free. fp8 KV comes automatically from the NVFP4 checkpoint.
The optimization that matters is not a flag
Here is the honest part, and I went and measured it rather than leaving it as theory. You can persist the compile cache and trim CUDA-graph sizes, and Gemma-4-31B is still a dense model decoding in the single digits. The real lever is the model, not the flag. So I brought up the same-family Gemma-4-26B-A4B MoE (FP8, RedHatAI build), which activates only ~3.8B of its 25B parameters per token, on the same box and the same probes. It decoded at 36.9 tok/s in a minimal config (no speculative decoding, no FP8 KV) against the dense 31B’s 6.8, and adding the speed pack (FP8 KV plus an MTP drafter at γ=4) lifted it to 53.7 tok/s single-stream on my box (about 50-60 on natural prose, up to 85-88 on predictable output, since speculative decoding’s gain tracks how well the drafter guesses, which tracks the content). It scored 7 out of 7 on the hard reasoning set, identical to the dense build’s larger sibling and to my Qwen (MTP speculative decoding is lossless, so the score is unchanged). I did not reproduce the 108 tok/s the community reports, which comes from more aggressive memory settings and a different measurement; my number is the conservative prefill-separated decode at util 0.50. Either way the conclusion is not subtle: the MoE is 5 to 8x faster than the dense build for no measurable reasoning loss, and the dense 31B is dominated. I am retiring the dense build and keeping the 26B MoE as the reasoning slot. The honest caveat is that my reasoning probe ceilings at 7/7 for Qwen too, so the case for running a Gemma reasoner at all alongside an already-7/7 Qwen rests on harder math than this set measures, or on offloading reasoning off the primary, not on out-scoring it here.
There is also the eager-vs-graphs question. On a discrete GPU --enforce-eager is a throughput mistake. On the Spark it was measured at only ~3% decode loss while saving ~13GB and removing the compile/capture entirely, because the workload is bandwidth-bound and CUDA-graph capture can grow unified memory unpredictably. For a memory-constrained, OOM-prone box that is a defensible trade. I did not chase the eager-vs-graphs A/B in the end, because the MoE made the dense build moot before it was worth the tuning time.
The upgrade trap: a newer vLLM breaks it outright
A warning for anyone tempted to chase a newer vLLM for more NVFP4 speed: do not, at least not on this model yet. I pulled the latest nightly (v0.23.1rc1.dev309) to fix an unrelated coding model, and it broke Gemma at startup with a modelopt quant tie_weights NotImplementedError. The NVFP4 quantization path’s weight-tying is unimplemented in that build, and Gemma ties its embedding and output-projection weights, so the engine dies during initialization. It runs on the older v0.20.2rc1 and not on 0.23. On bleeding-edge consumer Blackwell, the vLLM version is a load-bearing dependency you pin and test per model, not one you float. (I rolled back; the regression is tracked upstream as vLLM #45543, traced to PR #39612, which changed ParallelLMHead.tie_weights to delegate to a quant method that ModelOpt does not implement.)
Where it fits: a specialist you delegate to, not one you watch
6.8 tok/s is the number that decides the architecture, not just the model. At that speed Gemma is unusable as an interactive assistant: you would watch the cursor crawl. But that only rules out one mode of use. For a delegated, asynchronous reasoning job, where you hand off a problem and come back to the answer, decode speed matters far less than whether the answer is right. The open question this measurement sets up is therefore not “is it fast” (it is not) but “is its reasoning good enough to be worth the wait over Qwen”, which is a separate, capability test.
The unified-memory mutex shapes the rest. You cannot keep a fast general model hot and call this one concurrently on a single Spark, because two models do not fit. What you can do is swap: run Qwen as the default, and when a task genuinely needs the reasoner, switch to Gemma, run it, switch back. The persisted compile cache is what makes that swap cheap enough to be a workflow rather than a coffee break. Whether that is worth building depends entirely on the quality delta, and on whether a faster same-family MoE would let you keep the quality without the swap at all.
Reproduce
The working launcher (switch-llm.sh gemma), abbreviated:
docker run -d --name vllm-gemma4-31b --gpus all --network host --ipc host \
-v /ai/models:/ai/models -v /ai/vllm-cache:/ai/vllm-cache \
-e TORCH_CUDA_ARCH_LIST=12.1a -e VLLM_FLASHINFER_MOE_BACKEND=latency \
-e VLLM_CACHE_ROOT=/ai/vllm-cache/gemma \
ghcr.io/spark-arena/dgx-vllm-eugr-nightly:latest \
vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --served-model-name gemma4-31b \
--quantization modelopt --language-model-only \
--max-model-len 65536 --max-num-seqs 4 --gpu-memory-utilization 0.50 \
--enable-prefix-caching --enable-chunked-prefill --trust-remote-code \
--enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4
No VLLM_ATTENTION_BACKEND (vLLM forces TRITON_ATTN). Since I am retiring this dense build for the 26B MoE, I did not finalize the Marlin-forcing env (VLLM_USE_FLASHINFER_MOE_FP4=0 / VLLM_NVFP4_GEMM_BACKEND=marlin); those are the lever to evaluate first if you keep a dense NVFP4 model on sm_121.
Caveats
Single-stream, one GB10, vLLM v0.20.2rc1 nightly, June 2026. The decode numbers are single-user. NVFP4 on consumer Blackwell is explicitly experimental; the format and kernel support are moving, and a healthy boot does not guarantee correct output, so I smoke-test real generations. I run this model myself in the mutex rotation as the reasoning engine; whether it stays dense or becomes the 26B MoE is the open question this measurement is meant to settle.