How we got Mistral Small 4 119B inference working on NVIDIA DGX Spark's ARM64 GB10 chip with SGLang, including backend selection, speculative decoding, and Vibe CLI optimizations.

SGLang on DGX Spark: 35-41 tok/s with EAGLE Speculative Decoding


--attention-backend triton is the only backend that works on GB10

We started with the default flashinfer backend because it’s what SGLang ships with. The docs say “flashinfer is the fastest attention backend for CUDA cards.” That’s true for Ampere and Hopper, but GB10 uses SM121 which isn’t supported by flashinfer. The moment we tried to load a batch, the process either died with CUDA errors or silently OOM’d after a few hundred tokens.

python -m sglang.launch_server \
  --model-path mistralai/Mistral-Small-4-119B-Instruct-4.0-NVFP4 \
  --attention-backend flashinfer
# [E 2026-03-15 12:34:56.789 Server] CUDA error: invalid device function
# [E 2026-03-15 12:34:56.789 Server] OOM when allocating tensor

Watch out: the “invalid device function” error and silent OOM on GB10 both point to flashinfer not having SM121 kernels in its prebuilt library. The closely related SGLang GitHub issue #18203 (“DGX Spark [sgl_kernel] CRITICAL: Could not load any common_ops library!”) tracks a separate but architecturally adjacent problem on the same hardware: sgl_kernel failing to find libnvrtc.so.12 because the GB10’s compute capability 12.1 sits outside the PyTorch-supported range. Different symptom, same root cause family: the GB10 toolchain is still catching up with the silicon.

Switching to triton fixed it immediately:

python -m sglang.launch_server \
  --model-path mistralai/Mistral-Small-4-119B-Instruct-4.0-NVFP4 \
  --attention-backend triton
# Server starts cleanly, no OOM, no CUDA errors

Triton works because it provides a more portable backend that doesn’t rely on CUDA-specific optimizations. However, be aware that triton is generally slower on Ampere/Hopper GPUs compared to flashinfer, this is the tradeoff for stability on GB10. If you’re targeting DGX Spark exclusively, triton is your only viable option for now.


EAGLE speculative decoding delivered 2.5x throughput

Without speculative decoding, the base model gave us 12 to 15 tok/s on a 119B. That’s usable for short prompts, but for code refactoring or long analysis we needed more.

# Enable EAGLE with draft model draft-6x8B
python -m sglang.launch_server \
  --model-path mistralai/Mistral-Small-4-119B-Instruct-4.0-NVFP4 \
  --attention-backend triton \
  --speculative-draft-model-path mistralai/<smaller-compatible-draft-model> \
  --speculative-draft-accept-rate-threshold 0.3

The accept rate hovered between 2.5x and 3.4x depending on prompt length. For a 1,400-token code refactor we measured 35 to 41 tok/s. For a 120-token summary we hit 37 tok/s. That is real interactive speed, not synthetic benchmarks.

Watch out: EAGLE’s performance gains depend heavily on draft model quality and size relative to the target. If the draft model approaches the size of the target, you see diminishing returns from memory bandwidth saturation. The exact draft model path varies depending on what compatible NVFP4-quantized smaller variant is available at your build time, check the Mistral and SGLang model-compatibility tables before pinning. The NVIDIA DGX Spark developer forum has reports of similar issues when draft models exceed roughly 30 percent of the target model size.

Another gotcha: EAGLE’s speculative decoding can introduce latency spikes when the draft model rejects many tokens. We observed occasional 200ms delays during high-rejection phases, which can disrupt real-time applications. Monitor your accept rate closely, if it drops below 0.2, consider reducing the draft model size or switching to a more conservative threshold.


Vibe CLI startup dropped from 8.7s to 1.5s by removing one MCP server

Vibe 2.7.2 started every session with an 8.7-second delay. That’s not “a little slow,” that’s “I’ll go get coffee” slow. Profiling the MCP server logs showed the culprit:

[MCP] alby: starting npx -y @getalby/mcp ...
[MCP] alby: ready after 7.2s

Even when we never used Alby payments, Vibe still spawned the MCP server, downloaded npm packages, and initialized the extension. That’s seven seconds of pure waste.

The fix was one line in ~/.vibe/config.toml:

# Before:
[mcp.alby]
command = "npx"
args = ["-y", "@getalby/mcp"]

# After:
# [mcp.alby]  # removed

Result: startup time collapsed to 1.5 seconds. That’s a 7.2-second saving every time you open Vibe.

Gotcha: if you actually use Alby payments in Vibe, do not remove this section. If you only use Lightning via the browser extension, you can safely delete it. Some Vibe plugins may implicitly depend on Alby’s MCP server, test thoroughly after removal. The Mistral Vibe repository on GitHub (Mistral’s open-source CLI coding assistant, Python 3.12+, MCP support, agent profiles like default / plan / accept-edits / auto-approve) is the authoritative reference for the config schema and which MCP servers are safe to disable in your specific install.

Watch out: Disabling Alby’s MCP server may break other integrations that rely on its payment APIs. If you’re using Vibe for financial workflows, consider keeping the server but optimizing its startup sequence. Some users report success by pre-installing the Alby MCP package via npm install -g @getalby/mcp to avoid runtime downloads.

If you do not have an Alby account yet and want one for Lightning payments inside Vibe (or anywhere else), you can sign up via Alby . Honest disclosure: this is one of three affiliate links on the site (Alby , BitBox , FlokiNET ), all chosen because they are the no-KYC tools we actually use, not because they pay the most.


Monitoring the EAGLE accept rate in production

Throughput numbers are useless if you cannot tell whether the speculative-decoding accept rate has collapsed. SGLang exposes per-request stats; the cleanest way to notice regressions is a Prometheus scrape against the engine’s metrics endpoint plus a single rolling-window alert on accept rate.

# SGLang exposes /metrics in Prometheus exposition format
curl -s http://localhost:30000/metrics | grep -E "speculative_accept|spec_decoding_accept"

The metric to watch is the rolling fifteen-minute average of accepted draft tokens divided by proposed draft tokens. Healthy on our setup is ~0.30 to ~0.40. A sudden drop to ~0.10 means either the draft model is wrong for the current workload (e.g. you switched task type and the old draft no longer predicts well), or the draft model loaded into a degraded state. Restart the inference server before debugging deeper; the wins are big enough that an hour of degraded throughput is more expensive than a thirty-second restart.

# Prometheus alert (rough shape, tune thresholds for your hardware)
- alert: SGLangSpeculativeDecodingDegraded
  expr: avg_over_time(spec_decoding_accept_rate[15m]) < 0.15
  for: 5m
  labels: { severity: warning }
  annotations:
    summary: EAGLE accept rate fell below 0.15 on {{ $labels.instance }}
    description: Either the draft model is wrong for current workload or it is in a degraded state.

The five-minute for: window prevents alerts on transient spikes during cold start or model swaps, where the first few hundred tokens often miss before the cache warms up.

What I Actually Use

  • SGLang nightly-dev-cu13 on GB10. The stable releases lack the SM121 / compute-capability-12.1 paths the toolchain still needs. The nightly is the only build that boots cleanly on DGX Spark today.
  • Mistral Small 4 119B NVFP4, context length 65,536 tokens. NVFP4 is the right balance of memory footprint and quality for the 128 GB unified memory; FP8 variants exist but need recalibration that has not paid off in our workload.
  • Vibe CLI with Alby MCP disabled and only Gitea + monitoring MCP enabled. The startup-time saving is the visible win; the deeper one is that fewer MCP servers means fewer flaky-connection retries during a session.

The nightly-dev-cu13 build is required because the stable releases assume an older PyTorch ABI than DGX Spark’s compute capability 12.1 silicon needs. The SGLang DGX Spark setup thread on the NVIDIA developer forum is the closest thing to canonical guidance for this combination as of writing; expect it to change as official support catches up. The same forum has the GB10 compatibility threads that document MCP-adjacent issues users hit during inference setup.

Stack

SGLang on DGX Spark

Technical stack for 119B model inference

6
Draft Model smaller compatible variant
5
Speculative EAGLE decoding
4
Attention Triton backend
3
Runtime SGLang server
2
OS Linux ARM64
1
Hardware DGX Spark ARM64 GB10
Illustration: SGLang on DGX Spark: 35-41 tok/s with EAGLE Speculative Decoding

Was this worth it? Zap the article.

Value for value, no signup. Sats go straight to the writer.