Reading path
I'm fighting SGLang
The setup that works on GB10, then the specific failures in the order you are most likely to hit them.
5 articles, in reading order
- Self-Host Mistral Small 4 with SGLang on NVIDIA DGX Spark (GB10): What Actually Works
The baseline that works: backend choice, the flags, and the kernel constraints on SM121.
- SGLang Restart OOM Fix: Unified Memory Cleanup on GB10/DGX Spark
OOM on restart. The page-cache hijack and the drop_caches discipline before relaunch.
- SGLang on DGX Spark: 35-41 tok/s with EAGLE Speculative Decoding
Measured throughput, so you know what good looks like and when something is actually wrong.
- EAGLE Throughput Is Content-Dependent: Same Run, 14 to 31 Tokens Per Second
Why your tok/s swings with the prompt: EAGLE speculative decoding is content-dependent.
- Why SGLang Never Froze My Desktop But vLLM Did: an SM 12.1 MoE-Kernel Story
When the engine itself freezes the desktop: the SM 12.1 MoE-kernel story and why SGLang did not.