Power Failure Recovery on a DGX Spark: The 30-Minute Procedure

May 20, 2026 10 min read

Thirty minutes if you have rehearsed the procedure once. Two to six hours if you have not. The procedure below assumes a UPS that triggered a graceful shutdown when mains failed, a separate management host on the same network as the Spark, and a working runbook copy on a medium that survives the Spark being unavailable.

This is the procedure I run when the studio loses mains power, the UPS rides out the first sixty seconds, sends the shutdown signal to the Spark via NUT, and the Spark goes down cleanly. Mains comes back, the procedure begins.

Status note (updated 2026-06-09): This runbook was first published on 2026-05-20, before the Cloudflare Tunnel was retired on 2026-05-24. The public edge is now direct Caddy + Let’s Encrypt on the Floki^{₿Affiliate link. You support sovgrid at no extra cost to you. See /support.} ^↗ VPS with no Cloudflare in the path, and the MCP server runs as a container on Floki itself rather than on the Spark behind a tunnel. The steps below reflect the current architecture; the tunnel-specific failure modes from the original version are corrected inline rather than silently removed, so the change is visible.

Quick Take

Prerequisite: UPS with NUT integration, a management host on the LAN with SSH access to the Spark, a runbook copy on a printed sheet or USB stick (not just on the Spark itself).

Minute 0-5: verify mains stable, power on the Spark, confirm BIOS POST and Ubuntu boot.

Minute 5-10: SSH from management host, check disk integrity (fsck reports), confirm filesystem clean, drop the page cache before any service starts.

Minute 10-20: start the inference services in dependency order via systemd, verify each one healthy before moving to the next.

Minute 20-30: smoke-test the public endpoint, re-attach the dashboard, verify the MCP server is reachable, post the “we are back” status to the operator channel.

What goes wrong without the runbook: services started out of order, page cache not flushed (OOM at restart), Tailscale identity stale, the dispatcher routed to a backend still loading weights. (The original list ended with “Cloudflare Tunnel re-handshake never completes,” a failure mode that no longer exists since the 2026-05-24 tunnel retirement.)

What “graceful shutdown via UPS” assumes

The thirty-minute timeline only holds if the shutdown was graceful. Graceful means the UPS detected mains loss, sent a low-battery warning to NUT, NUT signalled the Spark and the management host to shut down, both hosts ran their systemd shutdown targets cleanly, and both hosts powered off before the UPS ran out. The procedure for a graceful shutdown recovery is the procedure below.

If the shutdown was not graceful (the UPS ran out before shutdown completed, or the power event was a hard cut with no UPS at all), the recovery is longer. Add roughly one hour for filesystem repair via fsck, another hour for journald recovery, and another hour for re-verifying model files via SHA against upstream checksums. Plan for a two-to-six hour worst case if your UPS strategy is not in place yet.

For the UPS configuration that makes the graceful path possible, see systemd Patterns for Self-Hosted AI Services. For the broader disaster-recovery context, see Strategy: Backup and Disaster Recovery.

Minute 0-5: power on and POST

Mains is back. The Spark’s UPS reports good battery and stable input. Press the Spark’s power button. Confirm the front-panel LED comes on, the cooling fans spin briefly at high RPM and then back down, and the system reaches BIOS POST. Watch for any error LEDs or beep codes.

On the management host (which has been online throughout the event, because its own UPS held), open an SSH session to the Spark’s IP. The first connection attempt may fail because the Spark has not yet brought up its network stack. Retry every fifteen seconds. The connection succeeds when the Spark has reached the multi-user systemd target and sshd is listening.

If the connection has not succeeded within five minutes, abort the timeline. The hardware may have a problem. Open the chassis, check for any unusual smell, dust accumulation on the heatsinks, or visible damage to the SSD. Most of the time the issue is benign (a USB key inserted before reboot is pulling the boot order off), but the next debugging step depends on what is actually wrong.

Minute 5-10: filesystem check and page-cache flush

On the SSH session, the first command is dmesg | tail -50 to see what the kernel logged during boot. Look for “EXT4-fs error” or “I/O error” lines. A clean boot has neither.

Next: journalctl -b 0 -p err to see any priority-error entries from the current boot. Most of these will be benign (the inference services failed because the Spark has not yet been told to start them, which is fine). The ones that matter are filesystem-level errors. If you see them, run sudo fsck -n /dev/nvme0n1p1 (read-only, no modifications) to assess the damage. If fsck -n reports issues, you are off the thirty-minute timeline and into the longer recovery procedure.

If the filesystem is clean, proceed. The single most important command before any service starts is:

sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

This flushes the kernel page cache. The Spark’s page-cache hijack failure mode is real and is the most common cause of “the service won’t start after a reboot” symptoms. (See Fixes: SGLang Restart OOM Fix for the worked example.) Run the command before you start the inference services, not after.

Minute 10-20: start services in dependency order

The systemd unit files for the inference stack declare their After= dependencies explicitly. The recommended order is:

network-online.target (waited for, not started)
tailscale.service (mesh networking)
prometheus-node-exporter.service (so the management host can see the box come up)
vllm-qwen36.service (primary inference)
sglang-mistral.service (secondary inference)
dispatcher.service (the router in front of both backends)
mcp-server.service (the public MCP integration surface)
caddy.service (the local reverse proxy on the Spark; the public edge is a separate Caddy + Let’s Encrypt on the Floki VPS, no tunnel upstream since 2026-05-24)

If the unit files declare these dependencies correctly, a single sudo systemctl start dispatcher.service will start the chain. If you are starting from a known-good state and want the explicit visibility, start each unit individually and wait for the active (running) status before moving to the next.

For each unit, the health check is:

systemctl is-active <unit> returns active
journalctl -u <unit> -n 20 shows no error-priority lines in the last 20 entries
The unit’s own healthcheck (Prometheus scrape, HTTP /health, or equivalent) returns success

Skipping a health check between units is the most common operator mistake. The downstream unit may start without complaining because its dependency check is shallow, but it will fail later under load because the upstream service was not actually healthy. Verify each step.

For the per-service patterns, see systemd Patterns for Self-Hosted AI Services. For the dispatcher’s role between two co-resident models, see Mistral vs Qwen 3.6 vs GLM-5 on a Single DGX Spark.

Minute 20-30: smoke-test and re-attach

The Spark is up, the inference engines are running, the dispatcher is routing. The remaining ten minutes verify that the system is actually serving traffic.

Smoke-test the primary inference path. From the management host:

curl -sS http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "qwen3.6", "messages": [{"role": "user", "content": "hello"}], "max_tokens": 32}'

The response should arrive within two to three seconds for the cold-start case and produce a coherent assistant message. If the response times out, the inference engine is not actually serving traffic; go back to the previous step and re-check the unit health.

Re-attach the dashboard. The sovgrid dashboard at services-sovereign-dashboard depends on Prometheus scraping the Spark; once the Prometheus exporter is running and the dashboard’s queries succeed, the operator’s overview comes back online. (See Services: Sovereign Dashboard for the dashboard architecture.)

Verify the MCP server is reachable. From an external network, hit https://mcp.sovgrid.org/self-hosted-ai/health (or the local equivalent on Tailscale). The MCP server is the public-facing integration surface; if it is not responding, customers’ agents will start failing. The MCP server now runs as a container on the Floki VPS, served by Caddy there directly; the Cloudflare Tunnel that originally linked Floki to the Spark for this path was retired on 2026-05-24. That means the MCP surface survives a Spark power event entirely, since it no longer depends on the Spark being up. The check confirms the Floki edge is healthy independent of the Spark’s own recovery.

Post the “we are back” status to whichever channel your customers expect. For sovgrid that is the Nostr account and the RSS feed; for an enterprise customer it might be a Slack channel or an internal status page. The post is part of the contract with the customer, not just a courtesy.

What goes wrong without the runbook

Three failure modes I have hit when I tried to recover without the runbook in front of me.

Services started out of order. Starting the dispatcher before the inference engines causes the dispatcher to log connection-refused errors that surface in the operator dashboard, and the operator (me) wastes ten minutes investigating a non-issue. The dependency declarations in the unit files prevent this; remembering to use systemctl start dispatcher.service rather than starting each unit by hand prevents it again.

Page cache not flushed. This is the single most expensive mistake. Skipping drop_caches=3 before starting the inference engine causes an OOM at 95 GB on a 70 GB model. The OOM triggers the systemd restart policy, which retries, which OOMs again, which retries, which trips the rate limiter, at which point the unit goes into the failed state and the operator has to manually intervene. Twenty minutes lost to a one-line command not run.

Tailscale identity stale. If the Spark was offline long enough that the Tailscale node key needs re-handshake, the SSH session works on the local network but the remote management surface does not. Solution: sudo tailscale up --reset on the Spark, then verify the node is visible in the Tailscale admin. Five minutes if you know about it, twenty if you do not. (See Tailscale vs Headscale for Multi-Box Sovereign Stacks, for the broader networking-layer reasoning.)

The runbook lives outside the Spark

The runbook copy that matters is the one that lives somewhere the Spark cannot take down. A printed sheet in the binder next to the workstation, a USB stick taped to the chassis, an entry in a notes-on-the-management-host directory: any of these are fine. A runbook that only lives on the Spark is useless when the Spark is the thing that needs recovering.

Refresh the runbook quarterly. The procedure changes as the stack evolves; a runbook that is six months out of date will lead you down a path that no longer applies. (See Strategy: Backup and Disaster Recovery for the broader disaster-recovery posture.)

Book a Stack Audit

The runbook above is the version that works for my stack. Your stack will be similar in structure but different in detail. A Stack Audit produces a runbook tailored to your specific deployment, including the systemd unit files, the dependency declarations, and the smoke-test commands.

To discuss a Stack Audit, reach out via the contact links in the footer. The cost of a tailored runbook is small compared to the hours saved on the first real recovery.

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—