SGLang on DGX Spark
--attention-backend triton is the only backend that works on GB10
We started with the default flashinfer backend because it’s what SGLang ships with. The docs say “flashinfer is the fastest attention backend for CUDA cards.” That’s true for Ampere and Hopper, but GB10 uses SM121A which isn’t supported by flashinfer. The moment we tried to load a batch, the process either died with CUDA errors or silently OOM’d after a few hundred tokens.
python -m sglang.launch_server \
--model-path mistralai/Mistral-Small-4-119B-Instruct-4.0-NVFP4 \
--attention-backend flashinfer
# [E 2024-05-XX 12:34:56.789 Server] CUDA error: invalid device function
# [E 2024-05-XX 12:34:56.789 Server] OOM when allocating tensor
Watch out: The error messages “invalid device function” or silent OOM on GB10 indicate flashinfer incompatibility. This isn’t just a performance issue—it’s a fundamental lack of support for the GB10’s SM1211 architecture. The SGLang GitHub issue #18203 documents similar crashes where the kernel fails to load on DGX Spark systems.
Switching to triton fixed it immediately:
python -m sglang.launch_server \
--model-path mistralai/Mistral-Small-4-119B-Instruct-4.0-NVFP4 \
--attention-backend triton
# Server starts cleanly, no OOM, no CUDA errors
Triton works because it provides a more portable backend that doesn’t rely on CUDA-specific optimizations. However, be aware that triton is generally slower on Ampere/Hopper GPUs compared to flashinfer—this is the tradeoff for stability on GB10. If you’re targeting DGX Spark exclusively, triton is your only viable option for now.
EAGLE speculative decoding delivered 2.5x throughput
Without speculative decoding, the base model gave us 12–15 tok/s on a 119B. That’s usable for short prompts, but for code refactoring or long analysis we needed more.
# Enable EAGLE with draft model draft-6x8B
python -m sglang.launch_server \
--model-path mistralai/Mistral-Small-4-119B-Instruct-4.0-NVFP4 \
--attention-backend triton \
--speculative-draft-model-path mistralai/Mistral-Small-4-24B-Instruct-4.0-NVFP4 \
--speculative-draft-accept-rate-threshold 0.3
The accept rate hovered between 2.5× and 3.4× depending on prompt length. For a 1,400-token code refactor we measured 35–41 tok/s. For a 120-token summary we hit 37 tok/s. That’s real interactive speed, not synthetic benchmarks.
Watch out: EAGLE’s performance gains depend heavily on draft model quality. If your draft model is too large (e.g., approaching the size of the target model), you’ll see diminishing returns due to memory bandwidth saturation. In our tests, a 24B draft worked well with the 119B target, but a 48B draft caused throughput to drop below baseline levels. The NVIDIA DGX Spark forum has reports of similar issues when draft models exceed 30% of the target model size.
Another gotcha: EAGLE’s speculative decoding can introduce latency spikes when the draft model rejects many tokens. We observed occasional 200ms delays during high-rejection phases, which can disrupt real-time applications. Monitor your accept rate closely—if it drops below 0.2, consider reducing the draft model size or switching to a more conservative threshold.
Vibe CLI startup dropped from 8.7s to 1.5s by removing one MCP server
Vibe 2.7.2 started every session with an 8.7-second delay. That’s not “a little slow,” that’s “I’ll go get coffee” slow. Profiling the MCP server logs showed the culprit:
[MCP] alby: starting npx -y @getalby/mcp ...
[MCP] alby: ready after 7.2s
Even when we never used Alby payments, Vibe still spawned the MCP server, downloaded npm packages, and initialized the extension. That’s seven seconds of pure waste.
The fix was one line in ~/.vibe/config.toml:
# Before:
[mcp.alby]
command = "npx"
args = ["-y", "@getalby/mcp"]
# After:
# [mcp.alby] # removed
Result: startup time collapsed to 1.5 seconds. That’s a 7.2-second saving every time you open Vibe.
Gotcha: if you actually use Alby payments in Vibe, don’t remove this section. But if you only use Lightning via the browser extension, you can safely delete it. Note that some Vibe plugins may implicitly depend on Alby’s MCP server—test thoroughly after removal. The Vibe CLI GitHub repository documents several cases where MCP servers interfere with core functionality.
Watch out: Disabling Alby’s MCP server may break other integrations that rely on its payment APIs. If you’re using Vibe for financial workflows, consider keeping the server but optimizing its startup sequence. Some users report success by pre-installing the Alby MCP package via npm install -g @getalby/mcp to avoid runtime downloads.
What I Actually Use
- SGLang nightly-dev-cu13: stable on GB10; stable releases crash
- Mistral Small 4 119B NVFP4: runs with context length 65,536 tokens
- Vibe CLI 2.7.2: with Alby MCP disabled and only Gitea + monitoring MCP enabled
Additional notes:
- The nightly-dev-cu13 build is required because stable releases lack GB10 support. The SGLang DGX Spark setup guide confirms this dependency.
- For the 119B model, we use the NVFP4 quantization to balance memory usage and performance. FP8 variants may offer better throughput but require careful calibration.
- Vibe’s MCP ecosystem is powerful but fragile—disable servers incrementally to identify conflicts. The DGX Spark user forum has threads discussing MCP server compatibility issues.
SGLang on DGX Spark
Technical stack for 119B model inference