systemd Patterns for Self-Hosted AI Services

May 20, 2026 7 min read

Update (2026-06-19). Any qwen3.6-prismaquant in the unit examples below predates the 2026-06-11 production switch to AutoRound int4-mixed (69.2 tok/s, PrismaQuant retired). The served name and port are unchanged, so the systemd units are identical; only the quant on disk changed. Live stack: /stack/.

Six unit-file patterns. The patterns themselves are documented elsewhere; the discipline of applying them consistently to every long-running service in a multi-service AI stack is the load-bearing operational habit.

Quick Take

Pattern 1: pre-flight commands in ExecStartPre= for things that must happen before the service starts (page-cache flush, scratch-directory creation, weight verification).

Pattern 2: explicit dependencies in After= and Wants=, not implicit ordering by name.

Pattern 3: bounded restart with Restart=on-failure, RestartSec=, and StartLimitBurst= to prevent infinite restart loops.

Pattern 4: resource ceilings via MemoryHigh=, MemoryMax=, and CPUQuota= to keep one service from starving another.

Pattern 5: structured environment via EnvironmentFile= instead of hard-coding flags in ExecStart=.

Pattern 6: graceful shutdown with TimeoutStopSec= and a SIGTERM handler that flushes state before exiting.

Pattern 1: pre-flight commands in `ExecStartPre=`

The single most common operational bug on a fresh restart is “the service started but the environment was not in the state the service expected.” The fix is a pre-flight command that gets the environment right before the service starts.

For the DGX Spark inference services, the canonical pre-flight is the page-cache flush:

[Service]
ExecStartPre=/bin/sh -c 'echo 3 > /proc/sys/vm/drop_caches'
ExecStart=/usr/local/bin/vllm serve qwen3.6-prismaquant --port 8000

(On the sovgrid stack the inference containers are Docker-managed via switch.sh, but the unit-file pattern applies identically to any service that wraps a long-running process.)

Without the pre-flight, the page-cache hijack failure mode (see Fixes: SGLang Restart OOM Fix) triggers an OOM at 95 GB on a 70 GB model load. With the pre-flight, the OOM does not happen.

Other useful pre-flight commands: SHA-verifying the model files before loading, ensuring the scratch directory exists and is writable, confirming the Tailscale identity is current, pre-creating Prometheus textfile-collector outputs.

The rule: any state your service implicitly assumes is correctly initialized should be explicitly initialized by an ExecStartPre=. If you cannot list the assumptions, the service has hidden coupling that will break on the next reboot.

Pattern 2: explicit dependencies in `After=` and `Wants=`

systemd’s parallel startup is fast and is exactly the wrong behavior for a stack where service B depends on service A. The fix is to declare the dependencies explicitly.

[Unit]
After=network-online.target tailscale.service
Wants=tailscale.service
Requires=prometheus-node-exporter.service

After= enforces ordering. Wants= and Requires= enforce that the dependency starts at all. The difference between Wants= and Requires= is whether a dependency failure should propagate to this service: use Requires= for hard dependencies (the service cannot function without it), Wants= for soft dependencies (the service can function but works better with it).

The canonical sovgrid dependency chain is in Power Failure Recovery on a DGX Spark: network-online.target → tailscale.service → prometheus-node-exporter.service → vllm-qwen36.service → sglang-mistral.service → dispatcher.service → mcp-server.service → caddy.service.

Implicit ordering by service name is a footgun. Two services with similar names will start in any order systemd chooses, and the order will be different across reboots.

Pattern 3: bounded restart with `Restart=on-failure`

A service that crashes and restarts immediately, repeatedly, without backoff, will exhaust resources and confuse the operator’s monitoring. The fix is bounded restart with backoff.

[Service]
Restart=on-failure
RestartSec=10
StartLimitBurst=5
StartLimitIntervalSec=300

This says: restart on failure, wait 10 seconds before each restart, allow up to 5 restarts in a 300-second window, then go into the failed state for operator intervention. The bounded burst prevents infinite restart loops; the rate-limited window prevents a slow leak (a service that crashes every five minutes) from going unnoticed.

For long-running inference services, Restart=on-failure is correct rather than Restart=always. The difference is that on-failure does not restart on a clean exit (which the operator may have intended via systemctl stop); always does, and you end up unable to stop the service cleanly without disabling it.

Pattern 4: resource ceilings via `MemoryHigh=` and `CPUQuota=`

A multi-service stack on a single host can produce noisy-neighbor problems. The inference service can consume so much memory that the Prometheus exporter starves; the dispatcher can spin so hard on CPU that the dashboard cannot scrape metrics. The fix is per-service resource ceilings.

[Service]
MemoryHigh=96G
MemoryMax=100G
CPUQuota=600%

MemoryHigh= is a soft threshold (the kernel starts reclaiming memory at this point); MemoryMax= is a hard threshold (OOM kill at this point). The two together let you tune for “use up to 96 GB usually, kill the service if it exceeds 100 GB” behavior.

For inference services that are intentionally memory-hungry, the ceilings need to be large but not unlimited. Setting MemoryMax= at 90 percent of physical memory keeps the rest of the system functioning even if the inference service runs away.

CPUQuota= is in units of “percent of one CPU.” 600% means six full CPU cores. The Spark has many cores; the inference path uses some, the dispatcher uses some, the MCP server uses some. Quotas keep them from contending unboundedly.

Pattern 5: structured environment via `EnvironmentFile=`

Hard-coding environment variables into ExecStart= is a maintenance pain. Use EnvironmentFile= to load variables from a separate file that is easier to edit and easier to share across multiple unit files.

[Service]
EnvironmentFile=/etc/sovgrid/inference.env
ExecStart=/usr/local/bin/vllm serve --model $MODEL_PATH --port $PORT

With /etc/sovgrid/inference.env:

MODEL_PATH=/data/models/qwen3.6-prismaquant
PORT=8000
VLLM_FLASHINFER_MOE_BACKEND=latency
HF_HOME=/data/hf-cache

The benefits: configuration changes do not require editing the unit file (which would require a daemon-reload); the file is auditable as a configuration artifact; multiple unit files can share the same environment file if they need the same baseline.

The downside: secrets in EnvironmentFile= are readable by anyone who can read the file. For real secrets, use LoadCredential= (systemd’s credential mechanism) or a separate secret-management layer.

Pattern 6: graceful shutdown with `TimeoutStopSec=`

Inference services with active state (KV cache, channel state, in-flight requests) should be given time to drain on shutdown. The default TimeoutStopSec=90 is sometimes too short for a service with a deep KV cache.

[Service]
TimeoutStopSec=180
KillSignal=SIGTERM
ExecStop=/usr/local/bin/sovgrid-graceful-shutdown.sh

KillSignal=SIGTERM is the default but worth declaring explicitly. The service should handle SIGTERM by closing accept-sockets, finishing in-flight requests, flushing any persistent state, and exiting cleanly. systemd then waits up to TimeoutStopSec= for the service to exit before sending SIGKILL.

ExecStop= gives a hook for explicit shutdown logic. For the Lightning node in [Setup: Alby^{₿Affiliate link. You support sovgrid at no extra cost to you. See /support.} ^↗ Hub ARM64 Self-Hosted Lightning](/blog/setup-alby-hub-arm64-self-hosted-lightning/), this is where channel state is flushed to disk before the node exits. For an inference service, this is where pending tokens are flushed and the KV cache is dumped if you want to warm-restart later.

Where this fits

For the broader operational context, see The Sovereign AI Stack in 2026. For the recovery procedure these patterns enable, see Power Failure Recovery on a DGX Spark. For the broader systemd documentation, the manual pages are authoritative.

One pattern deliberately absent: socket activation

Socket activation is a real systemd capability and useful for short-lived RPC services that should spin up on demand. It is intentionally absent from the patterns above because an inference container with a sixty-second warmup and a hot KV cache is the opposite of what socket activation is good at. The cost of cold-starting an inference service per request dwarfs the cost of keeping it running, and the unified-memory pool on a GB10 means a sleeping engine still holds its weight in RAM until explicitly unloaded. The general rule: socket activation pays off when startup is cheap and idle resource cost is high. Inference services on the GB10 invert both signals, so the unit files in this article run in long-lived mode with Restart=always and explicit cleanup hooks instead.

Follow the unit-file repository

A future article will publish the actual unit files in use on the sovgrid stack, with the per-line annotations explaining the choices. Follow via RSS or Nostr (links in footer) to catch it.

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—

systemd Patterns for Self-Hosted AI Services

Pattern 1: pre-flight commands in ExecStartPre=

Pattern 2: explicit dependencies in After= and Wants=

Pattern 3: bounded restart with Restart=on-failure

Pattern 4: resource ceilings via MemoryHigh= and CPUQuota=

Pattern 5: structured environment via EnvironmentFile=

Pattern 6: graceful shutdown with TimeoutStopSec=