GLM-4.7-Flash on a Single DGX Spark: the Repo Says AWQ, the Model Says MLA
I added GLM-4.7-Flash to my single DGX Spark for one reason: it is supposed to out-code my daily Qwen3.6, and at 30B total with 3B active it is small enough to leave the box headroom. The bring-up is where it got interesting. Two of the most-copied flags from the model card and the community recipes do not survive contact with Blackwell sm_121, and the failures are loud, fast, and instructive. I measured it the way I measure everything here: single-stream, one GB10, the same harness.
Verdict at a glance
| What it is | 30B total / 3B active MoE, compressed-tensors W4A16, MLA attention, MIT-style open weights |
| Single-Spark decode | 53.7 tok/s single-stream (range 49.8-54.9), once patched. Faster than the ~40 I expected; MTP speculative decoding earns its keep. (Qwen on the same box: ~69 tok/s.) |
| As a coding agent | works, produces correct terse code ("string"[::-1]). A full aider-polyglot pass_rate is impractical single-stream (a reasoning model thinking through 30 exercises at 53 tok/s runs for hours), so decode speed plus spot-correctness is the practical signal here. |
| Memory | ~81GB used / ~39GB free at util 0.50; healthy boot ~115s |
| Spark gotchas | ”AWQ” repo is actually compressed-tensors (do not pass —quantization); MLA forbids flash_attn; util 0.80 OOMs the desktop, use 0.50; and it boots healthy but dies on the first token until you patch MLA |
| Do I run it? | Yes, but only with a source patch. The first-token crash is NOT fixed by upgrading: it persists on the latest vLLM nightly (0.23.1) too, because PR #34695 guards the request-path reads but misses the ones in _compute_prefill_context. The same gap is tracked upstream as vLLM #43888, with a fix in PR #43889. Guard the prefill-context reads and GLM runs at 53.7 tok/s. |
Update (2026-06-26): I benched GLM head-to-head against Qwen and dropped it from the coding seat — Qwen-only now. A controlled agent-bench A/B (baseline arm, deterministic typecheck gate) on the
ts-renamecoding task measured GLM at ~195s/run vs Qwen at ~18s/run — about 10× slower wallclock in the agentic loop. Both produced correct, type-checking renames; the gap is wallclock, not correctness. GLM is a reasoning model, so it thinks through every tool-call round — fine for a one-shot answer, brutal for an interactive agent that loops dozens of times. I stopped the sweep early (thermals) before GLM reached the harder ambiguous-rename gate, so this is a speed verdict, not a final correctness ranking — but ~10× is decisive for an interactive default. Qwen keeps the coding seat; GLM is out of the daily rotation (the launcher and this write-up stay for reproducibility). The patched image still works and everything below reproduces — it just isn’t worth running over Qwen for day-to-day coding on one GB10.
The number nobody reports: single-stream on one GB10
Every GLM-4.7-Flash throughput figure online is a server number: many requests, batched, on datacenter GPUs. That is the wrong metric for an agent sitting on a desk, which experiences one request at a time with nothing batched behind it. On one DGX Spark, serving a single stream, vLLM decodes GLM-4.7-Flash at 53.7 tok/s (range 49.8-54.9), once it is patched to run at all. That is faster than the ~40 the forums led me to expect, and it puts GLM comfortably between my Qwen (69 tok/s) and the dense Gemma reasoner (under 7).
The reason the number is as high as it is comes down to active parameters plus a free lunch. GLM activates ~3B parameters per token against the GB10’s ~273 GB/s memory-bandwidth ceiling, the same ballpark as my Qwen, so the floor is similar. On top of that, MTP speculative decoding earns real throughput here: the model ships a single multi-token-prediction head, and at single-user batch sizes that is exactly where speculative decoding pays off (vLLM’s GLM recipe prescribes num_speculative_tokens 1 for the same reason). The 30B total is mostly idle weight that costs disk and RAM, not decode time.
Bring-up: two failures the recipes cause
This is the part worth the price of admission, because the public recipes actively mislead.
Failure 1: the “AWQ” build is compressed-tensors. The repo is named GLM-4.7-Flash-AWQ-4bit, so the obvious flag is --quantization awq_marlin. That crashes at config validation: “Quantization method specified in the model config (compressed-tensors) does not match the quantization method specified in the quantization argument (awq_marlin).” The weights are packaged as compressed-tensors W4A16, not classic AWQ. The fix is to pass no quantization flag at all and let vLLM auto-detect: it loads CompressedTensorsWNA16MarlinMoEMethod and picks the Marlin MoE kernels itself. The name lies; the config tells the truth.
Failure 2: the model speaks MLA, so flash_attn is illegal. The next instinct is --attention-backend flash_attn, which every fast-LLM guide recommends. It crashes: “Selected backend FLASH_ATTN is not valid for this configuration. Reason: [‘head_size not supported’, ‘kv_cache_dtype not supported’, ‘MLA not supported’].” GLM-4.x uses Multi-head Latent Attention, the same compressed-KV trick as DeepSeek, and flash_attn cannot do MLA. The fix is again to specify nothing: vLLM auto-selects TritonMLABackend, which on sm_121 is the only MLA backend with working kernels. FlashMLA (Hopper/SM100), FlashInfer-MLA and CUTLASS-MLA (CC 10.x) are all gated out on Blackwell consumer silicon (vLLM attention-backend docs). There is no faster MLA path to switch to today.
The unified-memory tax. Both failures are recoverable in a minute. The one that bites harder is gpu-memory-utilization. On a discrete GPU you set it to 0.90 and move on. On the Spark, that fraction is a fraction of the 128GB unified memory the OS also lives in. At 0.80 vLLM reserved ~97GB and left the desktop with ~5GB, which is an out-of-memory wall, not a slowdown. At 0.50 (a 60GB budget) GLM is comfortable and the system keeps ~60GB. MLA helps here too: its KV cache is roughly a tenth of standard attention, so 0.50 is generous, not tight.
Optimization: what moved the needle, what did not
I tuned for the mutex workflow (only one model resident at a time, swapped on demand) and for single-user latency.
- Persist the torch.compile cache across swaps. The container’s
/root/.cacheis wiped on everydocker rm, so vLLM recompiled its Inductor graphs on every model switch. Mounting a host cache (VLLM_CACHE_ROOTon a bind mount) skips the recompile on warm boots, as long as the launch flags stay byte-identical (the cache key hashes them). CUDA-graph capture still runs each boot, so warm is faster but not instant. - Trim CUDA-graph capture sizes to [1,2,4]. With
max-num-seqs 4the engine never runs a batch of 8, so capturing that graph wastes startup and memory. Trimming costs nothing at single-user concurrency. - Keep MTP at num_speculative_tokens=1. GLM ships one MTP head; raising it lowers the acceptance rate and ends up slower. The recipe and model card agree, and so did the box.
- Keep VLLM_FLASHINFER_MOE_BACKEND=latency. Counterintuitively, on sm_121 this does not select a faster kernel (the latency/TRTLLM path is SM100-only, so the W4A16 MoE runs on Marlin regardless). It matters because the throughput FlashInfer MoE path has broken SM120 kernels that freeze the entire desktop (see also vLLM #43906). The flag keeps you off the box-killer path.
What did not help and is worth not trying: forcing a different attention backend (none exist for MLA on sm_121), --enforce-eager (kills CUDA graphs for a real speed loss), and any non-Marlin quant kernel.
The wall: a healthy endpoint that cannot generate a token
This is where the bring-up ended, and it is the most important part. After the two fixes above, vLLM loads the model, reports Application startup complete, and answers /health with a 200. By every check a dashboard would run, GLM is up. Then you send it a single chat completion and the engine dies:
mla_attention.py, forward_mha -> _compute_prefill_context:
kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)
AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'
-> EngineDeadError
The cause is a clean incompatibility, not a tuning problem. MLA’s prefill step absorbs the kv_b_proj projection and reads its .weight tensor directly. In the cyankiwi build, kv_b_proj is packaged as compressed-tensors W4A16, so there is no plain .weight (the data lives in weight_packed and friends). vLLM’s MLA path in this nightly does not dequantize it, it just reaches for an attribute that is not there. I confirmed it is not a flag: the crash is identical with prefix-caching and chunked-prefill both off, with and without ignore_eos. It is the same stack every time.
A green /health told me nothing. The lesson generalizes: on Blackwell, with a quantized MLA model, a successful boot is necessary and nowhere near sufficient. Smoke-test one real generation before you believe a model works, let alone benchmark it.
”Just upgrade vLLM” does not fix it
The obvious move is to blame an old vLLM and pull a fresh nightly. I did: the image jumped from v0.20.2rc1 to v0.23.1rc1.dev309 (which is supposed to contain PR #34695, the fix for exactly this kv_b_proj.weight crash). It crashed in the same place. PR #34695 is incomplete: it guards the request-path reads, but _compute_prefill_context still reads kv_b_proj.weight.dtype unguarded, so on the latest nightly the model still dies, now even earlier (during the KV-cache profiling run at init rather than on the first request). The same gap is tracked upstream as vLLM #43888, with a fix in PR #43889. Upgrading was also a net loss for the rest of my stack: 0.23 broke my Gemma reasoner with a separate modelopt tie_weights NotImplementedError. I rolled back to 0.20.
The fix, and the numbers
The seam is MLA-meets-quantized-kv_b_proj, and the fix is small: guard the three self.kv_b_proj.weight.dtype reads (two of them in _compute_prefill_context) with hasattr(self.kv_b_proj, "weight") and fall back to the layer’s params_dtype when the weight is packed. That is the same shape as PR #34695, just applied to the reads it missed. I baked it into a patched image (a one-line sed on mla_attention.py that rewrites all three reads), pointed the launcher at it, and GLM came up clean: healthy in 115s, survives generation, decodes at 53.7 tok/s, and produces correct output (asked for a one-line string reversal it returns "string"[::-1] and nothing else).
Credit where due: this is the same guard the eugr/spark-vllm-docker DGX-Spark recipe ships as its fix-glm-4.7-flash-AWQ mod (a local copy of #34695, applied at build time), which is what pointed me at the real cause. That mod also bundles a separate triton_mla num_kv_splits speed patch I have not applied yet, it reportedly lifts short-context throughput further, so my 53.7 tok/s is likely a floor, not a ceiling. The crash fix is already tracked upstream as #43888 with a fix in PR #43889; my guard is the same shape, and I confirmed the same crash and fix on sm_121. Until that lands in a release, the patched image is what makes GLM runnable next to Qwen at all (though I later dropped it from the coding seat — see the Update at the top).
Reproduce
The working launcher (switch-llm.sh glm), abbreviated:
docker run -d --name vllm-glm47-flash --gpus all --network host --ipc host \
-v /ai/models:/ai/models -v /ai/vllm-cache:/ai/vllm-cache \
-e TORCH_CUDA_ARCH_LIST=12.1a -e VLLM_FLASHINFER_MOE_BACKEND=latency \
-e VLLM_CACHE_ROOT=/ai/vllm-cache/glm \
ghcr.io/spark-arena/dgx-vllm-eugr-nightly:latest \
vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit --served-model-name glm-4.7-flash \
--max-model-len 65536 --max-num-seqs 4 --gpu-memory-utilization 0.50 \
--compilation-config '{"cudagraph_capture_sizes":[1,2,4]}' \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--enable-prefix-caching --enable-chunked-prefill --trust-remote-code \
--enable-auto-tool-choice --tool-call-parser glm47 --reasoning-parser glm45
No --quantization (auto compressed-tensors), no --attention-backend (auto TritonMLA), no fp8 KV (unsupported with the MLA backend).
Caveats
Single-stream, one GB10, vLLM v0.20.2rc1 nightly, June 2026. The decode number is single-user and will differ under concurrency. GLM’s reasoning parser can leak <think> into content on some multi-turn tool-call paths; it is version-sensitive. I ran this as the coding engine in the mutex rotation for a while, but the head-to-head in the Update at the top retired it — Qwen now holds the coding seat; the GLM and Gemma launchers stay for reproducibility, not daily use.