Why SGLang Never Froze My Desktop But vLLM Did: an SM 12.1 MoE-Kernel Story
New to self-hosting AI? The Self-Hosted AI: Start Here hub walks the hardware-decision tree, the inference-engine choice, and the operational gotchas that bite hardest in the first months. This article is one of those gotchas.
New to this? Skip to “Plain-language version” near the bottom. The short story: one wrong setting made a GPU server crash the entire screen, not just itself, and it took days to see why.
The vLLM container serving Qwen3.6 on the DGX Spark ran for four days without a hiccup. Then I opened the file manager and the whole desktop froze. Hard. Mouse dead, no window redraw, nothing. SSH from my phone still worked. The CLI in my terminal still worked. Only the graphical desktop was gone.
I restarted gnome-shell. Desktop came back for about thirty seconds, then froze again the next time a GPU-touching app started (file manager, anything with thumbnails). Whack-a-mole.
The detail that cracked it: SGLang serving Mistral Small 4 had run for days before this with zero desktop freezes. Same hardware, same unified memory, same desktop. The freeze was specific to the vLLM stack, not generic GPU contention. That ruled out the easy explanation (“inference uses the GPU, desktop fights for it”) because SGLang used the GPU just as hard and never did this.
What the symptom looks like
When the bad kernel fires, the visual symptom is abrupt: the screen redraws stop, the mouse cursor freezes in place, and any open window becomes a static screenshot. There is no error dialog, no progress spinner, no warning. GNOME simply stops updating the display. Audio keeps playing if something was already buffered. The clock on the taskbar stops advancing.
If you run dmesg --follow in a terminal before it happens, you will see lines like:
[XXXXX.XXXXXX] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU-00000000:00:00.0
[XXXXX.XXXXXX] NVRM: Xid error: (0x00000000) BBB/ESR/ECR/TS
Xid 79 is a GPU display engine hang on NVIDIA hardware (as of kernel driver 580.x that ships with the DGX Spark platform firmware from early 2026). It means the display engine submitted work to the GPU and never got completion. On a machine with split VRAM this is contained; on unified memory, the display engine and the compute path share the same queue arbitration, which is why it wedges globally.
nvidia-smi at the moment of freeze reports 0% utilization and shows the compute processes still listed as running. The GPU is not crashed in the traditional sense; it is wedged waiting for a fence that the faulted MoE kernel never released.
What the system actually said
System load during a freeze: 0.67. Not a CPU thrash. free showed 53 GB available. Not memory exhaustion. nvidia-smi reported 0% GPU utilization. The machine was, by every normal metric, idle and healthy. gnome-shell was alive but stuck in a poll wait, blocked on a GPU fence that never returned.
A GPU fence is a synchronization point: the compositor asks the GPU “tell me when this draw is done” and waits. If the GPU never answers because a kernel launch faulted, the compositor waits forever. On a machine with a dedicated GPU and dedicated VRAM, a faulted compute kernel usually kills only the offending process. On the DGX Spark there is no separate VRAM. CPU and GPU share one 128 GB pool, one memory controller, one set of kernel-driver paths. A bad kernel launch there does not stay contained. It can wedge the path the display compositor also depends on.
The root cause
From the NVIDIA developer forums and the build.nvidia.com vLLM-on-Spark troubleshooting page, one line:
Make sure you are using
-e VLLM_FLASHINFER_MOE_BACKEND=latency, as the throughput backend has SM120 kernel issues on SM 12.1.
The Spark’s GB10 Blackwell GPU is compute capability SM 12.1. vLLM’s FlashInfer Mixture-of-Experts path has two backends: throughput (optimized for many concurrent users) and latency (optimized for single-stream). The throughput backend’s kernels were compiled and tested for SM 12.0 and earlier. On SM 12.1 they are broken. This issue was documented in the vLLM build.nvidia.com troubleshooting notes as of May 2026 and affects every vLLM container that uses a MoE model (Qwen, Mixtral, or any other sparse-expert architecture) on the GB10. Our launch script ran --performance-mode throughput with no VLLM_FLASHINFER_MOE_BACKEND override, so it took the broken path.
Every time the MoE layer hit that kernel under the right shape, the launch faulted, the GPU stopped answering fences, and on unified memory that did not just kill the request, it took the desktop compositor down with it.
SGLang never triggered this because SGLang does not use vLLM’s FlashInfer-MoE-throughput code at all. Different inference engine, different kernel path, no SM 12.1 throughput-MoE bug. That is the entire reason four days of Mistral-on-SGLang never froze the desktop and the first heavy desktop interaction under vLLM-on-Qwen did.
The fix
docker run ... \
-e VLLM_FLASHINFER_MOE_BACKEND=latency \
... \
vllm serve <model> \
...
# removed: --performance-mode throughput
# removed: --optimization-level 3
Three changes:
VLLM_FLASHINFER_MOE_BACKEND=latencyforces the MoE layer onto the latency backend, whose kernels work on SM 12.1.- Drop
--performance-mode throughput. It selects the broken backend by default and, for a single-user agent workload, it trades latency for batch throughput we do not use anyway. - Drop
--optimization-level 3. Same class of batch-throughput tuning, same single-stream antagonist.
Keep CUDA graphs on (do not add --enforce-eager). Graphs are not the problem and disabling them costs real speed.
After the change: container restarted, desktop opened the file manager, thumbnails rendered, no freeze. Verified across a fifteen-request sequential load with the desktop in active use. The compositor never stalled again. Tested on 2026-05-18 with the NVIDIA-provided vLLM container on the DGX Spark running kernel driver 580.x.
There is a measured cost. The latency backend runs the production config at about 70 tok/s instead of the roughly 73 the broken throughput path managed in the brief windows it did not crash. Three tokens per second is the price of a desktop that does not lock up. Easy trade.
What happens if you skip the fix
If you do not set VLLM_FLASHINFER_MOE_BACKEND=latency and keep running, the freezes are not random: they are load-triggered. Light, single-token requests can run for hours without incident. The moment a request hits a MoE layer at a shape that exercises the broken kernel, the desktop locks. In practice this means a container that “ran fine all morning” fails the instant you open a browser or a file manager alongside an active inference request. The failure is intermittent enough to look like a coincidence the first two times.
Ignoring it also means every freeze requires a gnome-shell restart (pkill -HUP gnome-shell or logging out), because the GPU fence never resolves on its own. The inference container keeps running and keeps serving requests over the network, which is why SSH stays up. Only the display path is wedged. On a headless server this would be invisible. On a desktop workstation it is a full productivity interruption every time a heavy inference request coincides with a graphical repaint. Tested on vLLM built from the NVIDIA-provided container (as of May 2026, container tag matching the Spark platform firmware 1.x line).
Why this was hard to see
The metrics lie when the failure is a faulted GPU kernel on unified memory. Load is low because nothing is spinning. Memory is fine because nothing leaked. nvidia-smi shows 0% because the GPU is not computing, it is wedged. Every dashboard says “healthy” while the screen is dead. The only signal that pointed anywhere was the comparison: SGLang never did this, vLLM did, same box. When two setups differ in one variable and only one fails, that variable is the lead. Here the variable was the inference engine’s MoE kernel path, and the forums had the answer once I knew to ask “vLLM MoE SM 12.1” instead of “DGX Spark desktop freeze”.
Plain-language version
Your computer’s graphics chip runs tiny programs called kernels to do its work. The AI model uses one kind of kernel, the desktop uses another. On this particular machine (a DGX Spark) the chip and the main memory are one shared pool, not separate. We were running the AI server in a mode whose kernels are buggy on this exact chip generation. When a buggy kernel ran, the graphics chip got stuck. Because everything shares one pool, “stuck” did not just crash the AI, it froze the whole screen. The terminal and remote login still worked because they do not need the graphics chip.
The other AI server we had used before (a different program called SGLang) never used the buggy kernel, so it never froze anything. That difference, four days fine on one, instant freeze on the other, same computer, was the clue.
The fix was one setting that tells the AI server “use the other, working kernels”. The screen has been stable since.
Takeaway
On shared-memory accelerators like the DGX Spark, a faulted compute kernel is not a contained failure. Treat “the whole machine froze but SSH works” as a kernel-path bug, not a resource problem, and look for the one variable that differs from a setup that worked. The dashboards will all say everything is fine. They are measuring the wrong thing.
Cross-references
- The companion measurement story, where another “obviously the model is bad” conclusion turned out to be a broken instrument: Mistral vs Qwen3.6 on DGX Spark: the 0/30 That Was a Broken Ruler
- The unified-memory sibling bug, where GPU memory not freeing instantly after a restart caused SIGKILL 137: SGLang Restart OOM Fix: Unified Memory Cleanup on GB10
- The SGLang speed baseline this article references (SGLang never froze the desktop in days of this): SGLang on DGX Spark: 35-41 tok/s with EAGLE
- Same lesson, different bug: a multi-hour “Blackwell GPU hang” that was really a config-init AttributeError, found by reading raw logs instead of trusting hardware theories: The 3.5-Hour Deadlock That Was Really an AttributeError