Reading path

I'm fighting SGLang

The setup that works on GB10, then the specific failures in the order you are most likely to hit them.

5 articles, in reading order

  1. Self-Host Mistral Small 4 with SGLang on NVIDIA DGX Spark (GB10): What Actually Works

    The baseline that works: backend choice, the flags, and the kernel constraints on SM121.

  2. SGLang Restart OOM Fix: Unified Memory Cleanup on GB10/DGX Spark

    OOM on restart. The page-cache hijack and the drop_caches discipline before relaunch.

  3. SGLang on DGX Spark: 35-41 tok/s with EAGLE Speculative Decoding

    Measured throughput, so you know what good looks like and when something is actually wrong.

  4. EAGLE Throughput Is Content-Dependent: Same Run, 14 to 31 Tokens Per Second

    Why your tok/s swings with the prompt: EAGLE speculative decoding is content-dependent.

  5. Why SGLang Never Froze My Desktop But vLLM Did: an SM 12.1 MoE-Kernel Story

    When the engine itself freezes the desktop: the SM 12.1 MoE-kernel story and why SGLang did not.

← All articles