Five operational failures from the first two months on a Spark, with the actual fixes and the postmortem links. The pattern is the same: a quiet default value, a non-obvious failure mode, several hours of confused debugging, then a one-line workaround that should have been in the documentation.

Five DGX Spark Disasters I Survived (You Don't Have To)

Five operational failures from the first two months on a DGX Spark, each with the actual fix and the postmortem link. The pattern is identical across all five: a quiet default value, a non-obvious failure mode, several hours of confused debugging, then a one-line workaround that should have been in the documentation. Read the list before you buy the hardware. Most of the disasters are avoidable if you know what to look for.

Quick Take

  • Disaster 1: Kernel page cache held stale model weights; next launch OOMed at 95 GB on a 70 GB model. Fix: echo 3 > /proc/sys/vm/drop_caches before every model swap.
  • Disaster 2: vLLM MoE backend defaulted to a kernel path that froze the entire window manager while inference appeared to run. Fix: VLLM_FLASHINFER_MOE_BACKEND=latency on service start.
  • Disaster 3: Hugging Face download silently truncated at 22 GB on a 60 GB model checkpoint, with no error message. Fix: SHA verification on every model file plus retry with explicit byte ranges.
  • Disaster 4: Mistral on SGLang threw BadRequestError on every multi-turn agent call because the strict-alternation tokenizer disagreed with the OpenAI-compatible schema. Fix: side-car proxy that rewrites the message sequence.
  • Disaster 5: EAGLE speculative decoding looked like free throughput in benchmarks; on structured-JSON output it collapsed throughput to 13-25 tok/s. Fix: disable EAGLE per-workload via dispatcher.
  • The pattern: every disaster was a default that was wrong for the most common workload. Build the runbook before you need it.

Disaster 1: the page cache hijack

The first crash that taught me to write a runbook.

Symptom. vLLM on the Spark refused to start after a previous Qwen 3.6 session had crashed. The model fit in roughly 22 GB on disk and the Spark has 128 GB of unified memory, so an OOM should have been impossible. The error was the standard “out of memory” message, triggered at roughly 95 GB of allocation. Restarting the inference engine did not help. Rebooting the host did help, which was the panic-button fix.

Root cause. The Linux kernel page cache had retained the previous session’s model weights. The kernel does not release the cache automatically on process exit because the cache is supposed to be opportunistic memory that gets reclaimed when needed. On the Spark’s unified-memory architecture, the cache reclamation path does not interact cleanly with the inference engine’s allocation pattern. The next launch sees 95 GB of “allocated” memory (most of it being stale weights) and refuses to start.

Fix. echo 3 > /proc/sys/vm/drop_caches before every model swap, with a small wrapper script that runs the command automatically on inference-engine restart. The line is in the runbook now, and the runbook is the first thing any new operator on this hardware reads.

The full postmortem is at Fixes: SGLang Restart OOM Fix. Cost to me: roughly six hours of confused debugging before the page-cache hypothesis surfaced. Cost to the next operator who reads this article: zero.

Disaster 2: the silent desktop freeze

The crash that taught me to operate the Spark headless.

Symptom. A long inference session on Qwen 3.6 PrismaQuant produced perfectly reasonable token output until the GNOME desktop session froze entirely, while the inference endpoint continued returning tokens to API clients. The mouse stopped moving. The keyboard stopped responding. The model continued generating. I had to SSH in from a laptop on the same Tailscale mesh to debug, because the local console was unusable.

Root cause. The vLLM MoE backend has several kernel selection paths. The default path for sm121 (the Spark’s CUDA capability tier) routes the FlashInfer MoE kernels through a code path that contends with the display server’s memory allocator. Inference continues because the inference path runs on the GPU; the display freezes because the CPU side of the contention has starved.

Fix. Set VLLM_FLASHINFER_MOE_BACKEND=latency in the systemd unit’s Environment= line before the inference service starts. The “latency” backend path uses a different kernel selection that avoids the contention. The default is “throughput,” which is wrong for interactive single-stream workloads on this hardware tier.

The full postmortem is at Fixes: vLLM MoE Throughput sm121 Desktop Freeze. The fact that the wrong default ships in a stable release is partly forgivable (Blackwell is a new platform) and partly not (the flag’s existence is the kind of detail that should be in the release notes). Reading the NVIDIA developer forum thread on the vLLM 0.17 MXFP4 patches gave me the context to ask the right question and find the flag.

Disaster 3: the silently truncated model download

The most infuriating disaster, because the failure was invisible.

Symptom. Mistral Small 4 NVFP4 (~60 GB on disk) downloaded from Hugging Face, and vLLM refused to load it with a vague “tensor shape mismatch” error. The downloaded file looked correct at the directory listing level. Re-downloading produced the same error. Re-downloading a third time and verifying the SHA against the upstream checksum revealed that the file was 22 GB on disk, not 60 GB, despite the hf-cli having reported “Download complete.”

Root cause. Hugging Face’s CLI silently truncated the download at the 22 GB mark, in a way that did not produce an error message and did not raise an exit code. The file system showed the file as written. The download tool’s progress bar reached 100 percent before hanging. The download path between my house and Hugging Face’s CDN had an intermittent issue at the 22 GB boundary, possibly tied to my upstream ISP’s connection handling, possibly tied to a CDN node that was rejecting long-running streaming downloads. The exact cause was never fully diagnosed, because the fix made the diagnosis unnecessary.

Fix. SHA verification on every model file on every download, plus retry with explicit byte-range resume on any mismatch. The download wrapper script now does the verification on every file in the model directory, and a single SHA mismatch triggers an automatic re-download of just the affected shard. The fix took roughly two hours to write. The original disaster cost me a day of confused debugging before the truncation hypothesis surfaced.

The full postmortem is at Fixes: HF Download Lies at 22GB. The lesson is that trust-but-verify applies even to first-party tooling from major platforms. The platform was not trying to lie; the failure mode was real, and the verification step was the load-bearing discipline.

Disaster 4: the alternating-roles bug

The crash that became a permanent piece of the production stack.

Symptom. Every multi-turn agent call against Mistral Small 4 on SGLang threw BadRequestError: Strict alternation violated. The single-turn calls worked. The multi-turn calls did not. The behavior was consistent across coding assistants (Vibe, OpenClaw, opencode), across OpenAI-compatible clients, across multiple SGLang versions. Restarting the service did not help. Adjusting the system prompt did not help.

Root cause. Mistral’s tokenizer applies strict role-alternation: the message sequence must be user, assistant, user, assistant, with no two consecutive same-role messages. The OpenAI-compatible API schema does not enforce this, and many client implementations inject a second user message (for example, a “session title generation” turn, or a tool-result message coded as a user role) that violates the alternation. SGLang on Mistral surfaces this as a 400 error rather than gracefully merging or reformatting the messages.

Fix. A side-car proxy that sits between the OpenAI-compatible client and SGLang, rewrites the message sequence to satisfy strict alternation (merging consecutive same-role messages or inserting a token to break them), and forwards the cleaned request to SGLang. The proxy is OpenClaw, which became a standalone tool because the fix has been useful enough across multiple coding-assistant clients that it was worth packaging.

The full postmortem is at Fixes: OpenClaw Mistral Alternating Roles, with the broader OpenClaw setup at Setup: OpenClaw Setup and a sibling case for OpenHands at Fixes: OpenHands BadRequest Fix. The fix has held in production for two months across multiple SGLang version updates.

Disaster 5: the speculative-decoding pessimization

The disaster that was hardest to recognize as a disaster, because the symptom was “slower than expected” rather than “broken.”

Symptom. EAGLE speculative decoding on Mistral Small 4 boosted throughput on conversational workloads from ~12 tok/s baseline to ~35 tok/s average. On structured-JSON output (tool calls, agent responses with strict schemas), the same configuration delivered 13-25 tok/s, roughly the baseline plus noise. The structured workload was meaningfully slower with EAGLE on than the same workload with EAGLE off.

Root cause. EAGLE’s draft model produces token distributions trained on free-form text. Structured-JSON output has a much narrower token distribution: brace, quote, colon, value, comma, repeat. The draft model’s predictions are wrong more often than they are right on this narrow distribution, and the verifier-pass overhead exceeds the savings from accepted speculative tokens. The technique is net-negative on structured workloads.

Fix. A dispatcher in front of the inference engine reads a per-call workload classification (code, creative, structured, vision) and disables EAGLE for the structured and code classes. The fix is implementation work, not a flag change. The result is that EAGLE accelerates the workloads where it actually helps, and gets out of the way on the workloads where it does not.

The full postmortem is at Fixes: EAGLE Content-Dependent Throughput.

The pattern across the five

Every one of these disasters was a default value or a default behavior that was wrong for the most common workload. The Spark is not unique in this. Every new platform has the same property in its first six to twelve months: the defaults are tuned by the vendor’s QA pipeline on a workload that does not match yours, and the right defaults emerge through the community’s collective postmortem activity.

The right operator posture is to read the postmortems before you adopt the platform. The cost of reading is small. The cost of rediscovering the disasters in production is large.

The right runbook is the runbook that lists the five wrong defaults you have hit and the workarounds for each. Build it as you go. Refer to it on every model swap. Update it when a new disaster surfaces. The runbook is the institutional memory of the operation, and it is the cheapest path to making the operation transferable to a second operator or a future you who has forgotten the details.

Where this fits

For the broader purchase-decision context, see Should You Buy a DGX Spark in 2026?. For the recovery procedure that builds on these postmortems, see Power Failure Recovery on a DGX Spark: The 30-Minute Procedure. For the systemd-unit patterns that operationalize the workarounds, see systemd Patterns for Self-Hosted AI Services.

Book a Stack Audit

If your prospective Spark deployment is going to encounter at least one of the five disasters above, the Stack Audit pre-loads the workarounds into your runbook before you hit the disaster in production. The audit cost is small compared to the days of debugging the disasters represent if you encounter them cold.

Reach me through any of the contact links in the footer of this page. Nostr DM is the fastest; the email link is HTML-entity-encoded so it survives spam scrapers.