Backing Up 119B Parameters Without Going Bankrupt on Storage
You do not back up the weights. You back up the model identifier, the configuration, the customer data, and the runbook. The weights are reproducible from upstream on restore. The data and the runbook are not.
This is the cheapest correct answer for a one-operator sovereign-AI stack in 2026. The alternative (full nightly snapshots of the model weights at 60 GB per copy) consumes terabytes of off-site storage per month for zero marginal recovery benefit, because the weights were already public on Hugging Face when you downloaded them and remain public when you need them back.
Quick Take
- Back up: configuration files, systemd unit files, prompts, dispatcher logic, customer data, RAG indexes, model identifier strings, fine-tuning artifacts, the runbook, the dashboard config, the Lightning node state.
- Do not back up: model weights, container images, Python virtualenvs, anything that is reproducible from a public registry.
- The exception: if you have fine-tuned a model and the fine-tuned weights are unique to you, those weights are not reproducible and they must be backed up.
- Storage budget at this discipline: the back-up set is typically <20 GB even for an active operation. Off-site storage at this volume is essentially free.
- Restore time: thirty minutes from a known-good backup if the upstream registry is reachable; longer if a model has been deprecated upstream and you need to find a mirror.
What is actually expensive to lose
The expensive losses are not the model weights. The expensive losses are the artifacts of operation that were never published anywhere else.
The configuration that says “Qwen 3.6 PrismaQuant 4.75bit, vLLM with VLLM_FLASHINFER_MOE_BACKEND=latency, DFlash k=3, gpu-memory-utilization 0.5.” That string of decisions took weeks to land on, encodes the disasters that produced each flag, and would take weeks to re-derive from scratch. The configuration file is a few kilobytes. Lose it, and you re-discover the decisions one by one.
The dispatcher logic that routes code calls to Qwen and creative calls to Mistral. The regex classifier, the routing rules, the fallback behavior on the case where the primary model is down. A few hundred lines of Python. Lose it, and the two-model stack stops working until you rewrite the dispatcher.
The customer data. The RAG indexes built from a customer’s document corpus, the embedding cache, the inference logs from past engagements, the chat histories that customers paid for. None of this is reproducible. All of this is small (typically gigabytes, not terabytes). Lose it, and the customer relationship is damaged.
The Lightning node state. The channel database, the funding transaction records, the routing fee history. Lose this without a recent backup, and the Lightning node can lose funds when channels force-close. (See [Setup: Alby ↗ Hub ARM64 Self-Hosted Lightning](/blog/setup-alby-hub-arm64-self-hosted-lightning/) for the Lightning-specific backup discipline, which is significantly stricter than the general-purpose backup discipline.)
The runbook. The institutional memory of what to do when each of the documented failure modes recurs. (See Five DGX Spark Disasters I Survived for the postmortems that built the current runbook.) Lose it, and the next disaster takes hours instead of thirty minutes.
The total size of all of this for a working sovgrid-class operation is typically under 20 GB. Most of it is text. Compressed and encrypted, the off-site backup is small.
What is actually cheap to lose
Model weights, in three categories.
Public open-weights models are cheap to lose because they are public. Re-download from Hugging Face on restore. The download takes hours; the storage cost of keeping a local backup is monthly. The trade is straightforward: pay the one-time hours on restore (rare) rather than the monthly storage (always).
Container images are cheap to lose because they are reproducible from Dockerfile plus the registry that hosts the base layers. Back up the Dockerfile, not the image. The image rebuild takes minutes on restore.
Python virtualenvs are cheap to lose because they are reproducible from requirements.txt plus pip install. Back up the requirements file, not the virtualenv. Recreate on restore.
The exception: fine-tuned weights. If you have spent compute time producing a fine-tune of a base model, those weights are unique to you and are not reproducible from any public source. Back them up. The size is usually a small fraction of the base model (a fine-tune adapter is typically megabytes, not gigabytes, on a LoRA-style approach).
The three-tier backup pattern
Tier 1: hot working state on the local NVMe. This is the live filesystem. It is not a backup; it is the working state. Disk failure here is the disaster the other two tiers exist to recover from.
Tier 2: cold snapshots to removable media or an on-LAN NAS. A USB drive or NAS on the same network as the Spark, holding a rolling set of snapshots of the back-up set. Frequency is an operator decision: nightly automated is one model; manual-when-you-plug-in-the-stick is another. The manual approach is not laziness — it is a deliberate sovereignty choice that keeps backup timing under operator control and off any automated schedule. Either model works. What matters is that the backup happens and is verified.
Tier 3: encrypted off-site backup to a cloud storage provider. The same back-up set, encrypted client-side, pushed off-site on whatever cadence fits the operation. The cloud provider holds the bits but cannot read them. Choose a provider whose pricing is volume-based rather than per-API-call, because backup workloads do many small operations. At a 20 GB working set, the monthly bill is in the low single-digit euros. (For the rented-dimension honesty here: the cloud storage provider is a named dependency. The encryption is client-side, the key never leaves the local machine, and the provider would see only encrypted bytes if compromised.)
The three tiers together produce a recovery posture that survives a single-hardware failure (recover from Tier 2), a site-level disaster like a fire (recover from Tier 3), and a hostile-action scenario like ransomware (recover from a Tier 3 snapshot from before the compromise). Each tier covers a failure mode the previous tier does not.
The actual backup script
The pattern in shell-script terms:
#!/bin/bash
set -euo pipefail
BACKUP_SET="/etc/sovgrid /var/lib/sovgrid /home/operator/projects /var/lib/lnd"
EXCLUDE_PATTERNS="--exclude=*.gguf --exclude=*/__pycache__ --exclude=*.weight --exclude=container-images"
# Tier 2: snapshot to NAS
rsync -aH --delete $EXCLUDE_PATTERNS $BACKUP_SET nas.local:/backups/sovgrid/$(date +%F)/
# Tier 3: encrypt and push to off-site
tar czf - $BACKUP_SET $EXCLUDE_PATTERNS \
| age -r $(cat ~/.config/age-recipient.txt) \
| rclone rcat remote:sovgrid-backups/$(date +%F).tar.gz.age
The EXCLUDE_PATTERNS is the load-bearing line. Excluding *.gguf, *.weight, __pycache__, and container-images keeps the model weights, the byte-compiled Python files, and the Docker images out of the backup. The remaining set is configuration, data, and code.
The encryption uses age (which is sound, well-audited, and has a small attack surface) with a single recipient key held by the operator. The encrypted bytes go to rclone-supported cloud storage. The pattern is robust against any provider compromise because the encryption is client-side.
The set -euo pipefail at the top is non-negotiable. A backup script that exits 0 on partial failure is worse than no backup at all, because the operator believes they have a backup when they do not.
One caveat on the rsync Tier 2 path: if the NAS disconnects mid-transfer, rsync may leave partial files in the destination with no indication beyond a non-zero exit code. With set -euo pipefail, that non-zero exit stops the Tier 3 pipeline before the encrypt-and-push step runs. Good. But it also means the NAS snapshot for that run is incomplete. Verify with rsync --checksum on the next run, not just --delete. A second caveat applies during a model-version migration: if you upgrade from Qwen 3.6 PrismaQuant to a successor model and the configuration schema changes (new keys, renamed flags, dropped env vars), the old configuration in /etc/sovgrid/ may restore cleanly but fail silently at runtime. Pin the configuration to a model version string in the filename, for example vllm-qwen36-prisma475bit.env, so a restore drill will surface the schema mismatch before it becomes a production incident.
How to verify a backup
A backup that has never been restored is not a backup. Verify with a restore drill at least quarterly.
The drill: provision a fresh VM, restore the Tier 3 backup, walk through the runbook, confirm that the inference services start and serve a smoke-test query. The drill takes a few hours. If it fails, the backup is broken, and fixing the backup discipline is now the highest-priority work.
The drill also surfaces the small undocumented dependencies that the operator has been carrying in their head. The Tailscale key, the Lightning seed phrase (which lives on a hardware wallet, not in the backup), the operator’s SSH key from the management host. These dependencies should be documented in the runbook; the drill is the test that catches the ones that are not.
Where this fits
For the broader DR posture, see Strategy: Backup and Disaster Recovery. For the power-event recovery procedure, see Power Failure Recovery on a DGX Spark: The 30-Minute Procedure. For the reference architecture that contextualizes the backup discipline, see The Sovereign AI Stack in 2026.
Read the recovery runbook next
The backup discipline above is half the story. The other half is the recovery procedure that uses the backup. Read Power Failure Recovery on a DGX Spark for the operational pair.