Self-Host Mistral Small 4 with SGLang on NVIDIA DGX Spark (GB10): What Actually Works
The SGLang stable image crashes on GB10 before serving a single token. Three days of debugging later, I found the exact nightly build, the exact flags, and the one missing file that the NVFP4 release quietly omits. A month of running it in production added five more.
Quick Take
- Mistral Small 4 119B NVFP4 runs on NVIDIA DGX Spark (GB10) with SGLang nightly + CUDA 13
- Use
--attention-backend triton, notflashinfer. Flashinfer crashes immediately on SM 12.1- Expect 35–41 tok/s with EAGLE speculative decoding, ~94 GB RAM during inference, 30–120s RAM hold after
docker kill- The NVFP4 release omits
config.jsonandtokenizer.json— copy them from the base repo first- SGLang, Voxtral, and ComfyUI cannot share GPU memory: one at a time, always
Hardware
The DGX Spark GB10 Blackwell SoC pairs an ARM v9.2-A CPU with a GPU sharing the same 128 GB LPDDR5x unified memory pool. No dedicated VRAM. Everything runs ARM64. Every script, every binary, every Docker image needs an arm64 manifest.
Confirm the GPU is visible before anything else:
nvidia-smi -L
# GPU 0: NVIDIA GB10 Grace Blackwell (SM 12.1) @ 128 GB LPDDR5x Unified Memory
Note: nvidia-smi --query-gpu=memory.used returns [N/A] on GB10 because there is no separate VRAM to query. Use /proc/meminfo for actual memory state.
Why SGLang Nightly, Not Stable
The stable SGLang image does not recognize GB10’s SM 12.1 architecture. It fails on launch:
CUDA error: invalid device ordinal
Only the nightly build with CUDA 13 support works. Pin a specific nightly tag rather than pulling latest — nightly images can break without notice, and the gap between “tag works” and “tag silently regresses” is sometimes a single push. Available tags are on Docker Hub:
lmsysorg/sglang:nightly-dev-cu13-20260323-999bad5a
This is the tag I run in production. When upgrading, test against the existing --cuda-graph-max-bs and --mem-fraction-static values first; flag semantics drift between nightly builds without changelog entries.
Before You Download: ARM64 Workarounds
The Hugging Face IPv6 endpoint is unreachable on the DGX Spark’s network stack. Add these before any download:
export HF_HUB_DISABLE_XET=1 # Xet protocol defaults to IPv6 - disable it
# Always use -4 with wget to force IPv4:
wget -4 <url>
The CLI is named hf, not huggingface-cli:
hf download mistralai/Mistral-Small-4-119B-2603-NVFP4
The model is on HuggingFace.
Critical: The NVFP4 release omits config.json and tokenizer.json for ARM64. Copy them from the base model repo into your weights directory before starting SGLang. Without them, the server fails with KeyError: 'tokenizer.json'.
The Working Docker Command
This is the exact command that starts the server without crashing. Every flag matters:
SGLANG_ENABLE_SPEC_V2=True docker run -d \
--name sglang-mistral4 \
--gpus all \
--network host \
--ipc host \
--restart unless-stopped \
-v /ai/models/mistral-small-4-nvfp4:/model \
-v /ai/models/models--mistralai--Mistral-Small-4-119B-2603-eagle/snapshots/3ff299733b3dcb701617a22add5ce796304f7f05:/eagle \
lmsysorg/sglang:nightly-dev-cu13-20260323-999bad5a \
python -m sglang.launch_server \
--model-path /model \
--tokenizer-path /model \
--host 0.0.0.0 \
--port 30000 \
--attention-backend triton \
--moe-runner-backend flashinfer_cutlass \
--mem-fraction-static 0.75 \
--context-length 65536 \
--cuda-graph-max-bs 32 \
--max-running-requests 16 \
--speculative-algorithm EAGLE \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-num-steps 3
Flag notes:
SGLANG_ENABLE_SPEC_V2=True activates the second-generation speculative decoding kernel. Don’t omit it — EAGLE acceptance rates are measurably lower without it.
--attention-backend triton is required on GB10. The default flashinfer backend crashes immediately on SM 12.1 (not yet supported in the current nightly).
--moe-runner-backend flashinfer_cutlass runs the MoE routing layer through FlashInfer’s cutlass kernel. Separate from the attention backend — it doesn’t crash on SM 12.1 and gives a measurable throughput improvement on Mistral’s MoE layers versus the triton fallback.
--cuda-graph-max-bs 32 sets the maximum batch size for CUDA graph capture. Above 32 on GB10: OOM during server warmup before a single request is processed.
--max-running-requests 16 caps concurrent in-flight requests. Without it, the server accepts unbounded concurrent load and won’t reject requests before OOM conditions develop.
--restart unless-stopped and --rm are mutually exclusive Docker flags. Don’t combine them — Docker silently drops the restart policy when both are present.
Note: --skip-server-warmup is not used. It speeds up startup but leaves CUDA graphs uninitialized, which causes high latency on the first real request batch. The 30-second warmup pays for itself in consistent throughput.
Real Numbers
On the DGX Spark with Mistral Small 4 NVFP4 + EAGLE:
| Prompt type | Context tokens | Output speed |
|---|---|---|
| Short (summary) | ~120 | 37 tok/s |
| Medium (analysis) | ~800 | 25 tok/s |
| Long / code | ~1400 | 35–41 tok/s |
| Average | ~35 tok/s |
| Metric | Value |
|---|---|
| Baseline without EAGLE | ~12–15 tok/s |
| EAGLE acceptance rate | 2.5–3.4x |
| First request latency | ~30s (warmup) |
| RAM during active inference | ~94 GB |
| Minimum free RAM to start | 70 GB |
EAGLE speculative decoding runs a small draft model in parallel to predict tokens. The acceptance rate measures how often those predictions are correct. At 2.5–3.4x, roughly 2–3 tokens are accepted per step instead of one. See the EAGLE repository for implementation details.
Memory Management After Stopping
After docker kill sglang-mistral4, unified memory does not release immediately. Wait 30–120 seconds before restarting. Check with:
free -h
A helper script automates this guard — poll until free memory exceeds 70 GB or timeout:
#!/bin/bash
TIMEOUT=300
INTERVAL=5
elapsed=0
while [ $elapsed -lt $TIMEOUT ]; do
free_gb=$(free -g | awk '/^Mem:/{print $7}')
[ "$free_gb" -ge 70 ] && echo "Ready: ${free_gb}GB free" && exit 0
echo "Waiting... ${free_gb}GB free (${elapsed}s elapsed)"
sleep $INTERVAL
elapsed=$((elapsed + INTERVAL))
done
echo "Timeout: memory did not clear in ${TIMEOUT}s" && exit 1
Warning: starting a new container before memory clears doesn’t fail immediately. The weights load, the server reports ready, and then the first inference request OOM-kills the process. The failure looks like a successful launch. You won’t see the problem until a client tries to use the endpoint.
Warning: docker restart sglang-mistral4 doesn’t help here. It releases and immediately re-acquires unified memory without the hold period. Use docker kill, wait for free -h to show at least 70 GB available, then docker run fresh.
The Sequential-Services Rule
SGLang holds ~94 GB of the 128 GB unified pool during inference. Voxtral TTS holds ~111 GB while loaded. ComfyUI with FLUX.1-schnell holds ~14 GB. All three trying to share the pool: OOM, every time.
Run one at a time. The pattern that works:
| Workflow | Order |
|---|---|
| Article generation | SGLang up → write articles → SGLang stays |
| Podcast generation | SGLang stop → wait 60s → Voxtral up → generate audio → Voxtral stop |
| Hero image generation | SGLang stop → wait 60s → ComfyUI up → generate → ComfyUI stop → SGLang up |
A single dashboard with start/stop controls and a 60-second guard between transitions removes the manual coordination cost. The same free -h ≥ 70 check belongs in the start handler.
Per-Service systemd with Targeted NOPASSWD
The dashboard needs to start, stop, and restart these services without a password prompt. The temptation is a wildcard like cipherfox ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart *. Don’t.
Per-service entries scope the privilege to exactly what’s needed:
# /etc/sudoers.d/sglang
cipherfox ALL=(ALL) NOPASSWD: /usr/bin/systemctl start sglang-mistral4
cipherfox ALL=(ALL) NOPASSWD: /usr/bin/systemctl stop sglang-mistral4
cipherfox ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart sglang-mistral4
A wildcard restart * lets a compromised dashboard restart any system service — sshd, networking, the firewall. Per-service entries make it impossible to escalate from a single API endpoint into a sudo-restartable system service that wasn’t on the original list.
What Crashes (And Why)
| Flag / Image | Why It Fails |
|---|---|
--attention-backend flashinfer | SM 12.1 not supported, instant crash |
--mem-fraction-static 0.88 | OOM during initialization, tested and confirmed |
--mem-fraction-static > 0.85 (any) | OOM at startup, not during inference |
--cuda-graph-max-bs > 32 | OOM during warmup before first request |
--speculative-eagle-topk 4 | Wrong value, correct is 1 |
--rm + --restart | Mutually exclusive Docker flags |
| SGLang stable image | No SM 12.1 support |
| Restart without 60s wait | Memory not released, OOM on first inference |
| Voxtral while SGLang runs | 111 + 94 > 128, instant OOM |
Verify the Endpoint
curl -s http://127.0.0.1:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Mistral-Small-4","messages":[{"role":"user","content":"Hello"}],"max_tokens":50}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['choices'][0]['message']['content'])"
If you see BadRequestError: Alternating roles required from a client like OpenHands, that is a separate issue with how the client formats messages. The fix is to set enable_prompt_extensions=false in the OpenHands config.
reasoning_effort and Reporting Quirks
Two known SGLang quirks on this build that will confuse you if you don’t expect them.
reasoning_tokens always reports 0. SGLang’s response metadata shows reasoning_tokens: 0 even when reasoning is active. The model is reasoning — reasoning_content is populated correctly in the response body. It is a reporting bug in SGLang, not a model configuration error.
Only “high” and “none” work for reasoning_effort. Values like "low" or "medium" are silently ignored and the server defaults to no reasoning. If you need reasoning, set "high". If you don’t want it, set "none". There is no middle ground on this nightly build.
# Working
curl ... -d '{"reasoning_effort": "high", ...}'
curl ... -d '{"reasoning_effort": "none", ...}'
# Silently ignored — no reasoning, no error
curl ... -d '{"reasoning_effort": "low", ...}'
Self-Diagnosis Tooling
A common failure pattern: SGLang crashes, the operator pastes the error to a chat assistant, the assistant doesn’t know about GB10 quirks and suggests --attention-backend flashinfer. The cycle repeats.
The fix that actually scaled: a small MCP server exposing a diagnose_sglang tool with the GB10/SM121A rules embedded. Local AI agents (OpenClaw, Vibe) call it directly when SGLang misbehaves and get the right answer without the operator re-explaining the hardware every time. The same MCP server also exposes a blog-search tool so the agents can find the original setup article instead of inventing flag combinations.
The point isn’t the specific implementation. It’s that brittle setups deserve diagnostic tooling co-located with the system, not buried in a chat history.
One wrong flag and the container exits silently. One missing tokenizer file and the server never starts. The setup is narrow but reproducible. Once it runs, you get a full 119B MoE model on local hardware at zero cloud cost, with consistent throughput that doesn’t degrade under repeated use.
Note: --network host exposes port 30000 on all interfaces. Avoid this on shared or semi-public machines. Bind to --host 127.0.0.1 instead and use a reverse proxy if you need external access from other devices on the network. Do not expose the SGLang port directly to the internet without authentication — it accepts any request without credentials by default.
What I Actually Use
- DGX Spark GB10: the only consumer-grade machine I’ve found where a 119B MoE model runs at useful speed without a data center
- SGLang nightly (cu13): the stable release simply doesn’t work on SM 12.1, nightly is the only option
- EAGLE speculative decoding: 2.5–3.4x throughput gain with no quality loss, worth the extra model file
- One-service-at-a-time discipline: the rule that keeps the OOM-killer asleep