My Backup Ran for Six Weeks Without Backing Anything Up
For six weeks (the entire life of this DGX Spark so far) systemctl status sovereign-backup.timer showed green. The journal showed clean exits. No errors, no alerts, no missed schedules on the dashboard. Not a single backup tar was ever written. Four independent bugs lined up to produce a system that looked like it was working and was not. This is the postmortem and the rebuild that replaced it.
Quick Take
- A green systemd timer status reports the schedule, not the job exit
- Four silent bugs combined to keep the backup script from ever executing successfully
- The fix was a rewrite, not a patch: new service file, USB repartition, atomic writes, ERR-trap
- Final architecture: Tier A NVMe nightly (14d), Tier B USB rolling (30d), age-encrypted, single recipient key
- Lesson generalises: any service with
enabled+activeand no end-to-end smoke test is a candidate for the same class of failure
What looked fine
The setup was straightforward, on paper. A systemd timer fires sovereign-backup.service nightly at 02:00. The service runs /usr/local/bin/backup.sh, which tars /data/projects plus /data/secrets plus a few other directories, pipes through age --recipient to encrypt, writes the tarball to /data/backups/, prunes anything older than 14 days, exits zero.
systemctl list-timers showed the timer scheduled correctly, last-fired stamps moved nightly, the unit stayed active. The dashboard pulled timer status from the same source and rendered it green. I never set up an alert because the timer never reported a failure to alert on.
The first time I needed to restore a deleted file, the backup directory was empty. Six weeks of empty.
Bug 1: ExecStart pointed at a symlink that was never created
The unit file shipped with this:
# /etc/systemd/system/sovereign-backup.service (broken)
[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh
The actual script lived at /data/projects/sovereign-backup/backup.sh. The deploy procedure assumed a symlink at /usr/local/bin/backup.sh that pointed at the real script. The symlink was planned and never made. systemd dutifully ran /usr/local/bin/backup.sh every night, got a No such file or directory exit, and moved on. The timer’s success state reflected only “the timer fired”, not “the service did anything useful”.
systemctl status sovereign-backup.timer shows timer health. systemctl status sovereign-backup.service would have shown the failed exits, but the dashboard scraped only the timer.
Bug 2: age was not installed at all
The backup script’s preflight checked for age:
command -v age >/dev/null || { echo "age not installed" >&2; exit 1; }
The check worked correctly: age was not on the system, the script exited 1, and the message went into the journal. Nothing read the journal. The dashboard, which was never wired up to actually parse journalctl -u sovereign-backup.service output, did not surface the preflight failure. Even after Bug 1 was found and fixed, the script would still have exited at the preflight without ever encrypting a tar.
Bug 3: key paths in the script did not match where the keys lived
age-keygen writes to ~/.age-identity and ~/.age-recipient by default. As root that resolves to /root/.age-identity and /root/.age-recipient. The backup script referenced /data/secrets/age-identity and /data/secrets/age-recipient. The setup documentation said one path; the implementation referenced another. Even if Bugs 1 and 2 had been fixed, the script would have failed when reading the recipient key.
The fix is structural: the script and the setup doc both reference /data/secrets/age-identity and /data/secrets/age-recipient, the keys are migrated there once, permissions tightened to chmod 600.
Bug 4: the USB stick was FAT32
The first USB stick I plugged in for Tier B was FAT32, the factory format on most consumer sticks. Compressed Sparky tar comes in around 1.2 GB on a quiet day. Day 1 backup: 1.2 GB, fits. Day 2: another 1.2 GB. Day 3: the running combined archive crossed the 4 GB FAT32 file-size limit and tar failed with File too large. Tar exited non-zero, age never got input, no encrypted file landed on the USB.
This bug had a different signature than the other three: it produces a real visible error in the journal, but only after several days of nominally-working runs. It would have been caught by an end-to-end smoke test that wrote a synthetic large file and read it back. There was no such test.
The rebuild
Once it was clear the failure was structural, not patchable, I rewrote the backup system end-to-end. The new shape:
# /etc/systemd/system/sovereign-backup.service (corrected)
[Unit]
Description=Sovereign nightly local backup
After=network-online.target
[Service]
Type=oneshot
ExecStart=/data/projects/sovereign-backup/backup.sh
ProtectSystem=strict
ReadWritePaths=/data/backups /var/log
NoNewPrivileges=true
StandardOutput=journal
StandardError=journal
# /etc/systemd/system/sovereign-backup.timer
[Unit]
Description=Run nightly local backup
[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
RandomizedDelaySec=15min
[Install]
WantedBy=timers.target
The script fix:
# /data/projects/sovereign-backup/backup.sh: atomic write + ERR trap
set -euo pipefail
RECIPIENT_FILE=/data/secrets/age-recipient
DEST_DIR=${1:-/data/backups}
NAME="sovereign-backup-$(date +%Y%m%d-%H%M%S).tar.gz.age"
FINAL_FILE="${DEST_DIR}/${NAME}"
TMP_FILE="${FINAL_FILE}.tmp"
trap 'rm -f "$TMP_FILE"; logger -t sovereign-backup "aborted, tmp removed"' ERR
tar --exclude='node_modules' --exclude='.cache' \
-cf - /data/projects /data/secrets /data/gitea \
| pigz -c \
| age --recipient-file "$RECIPIENT_FILE" --output "$TMP_FILE"
mv "$TMP_FILE" "$FINAL_FILE"
trap - ERR
# Retention prune
find "$DEST_DIR" -name 'sovereign-backup-*.tar.gz.age' -mtime +14 -delete
The atomic temp-file plus ERR-trap pattern means a partial tar never lands at the canonical filename. Either the full encrypted tarball moves into place, or nothing does, plus a journal entry that the dashboard now actually scrapes.
The USB stick, repartitioned
A 256 GB Samsung USB-C stick split into two partitions:
| Partition | Size | Format | Mount |
|---|---|---|---|
| sdb1 | 40 GB | ext4 | /mnt/sovereign-usb (encrypted backups) |
| sdb2 | ~199 GB | exFAT | /mnt/sovereign-usb-media (cross-platform media) |
ext4 for the backup partition: no file-size cap, supports POSIX permissions and atomic rename. exFAT for the media partition: macOS, Windows, Android can read it, no 4 GB limit. The media partition was a side-benefit, not part of the backup story; it just happens to use the same physical stick.
The fifth bug, found while testing the rebuild
After fixing the four original bugs and rebuilding, the dashboard’s “Backup to USB” button still failed. The dashboard service had ProtectSystem=strict. The button shelled out to sudo backup-to-usb.sh, which tried to write into /mnt/sovereign-usb. systemd’s hardening blocked the write with a read-only file system error. Adding the USB mountpoints to ReadWritePaths fixed it:
# Dashboard service unit
ReadWritePaths=/data /var/log /var/lib/tor /var/lib/aide \
/mnt/sovereign-usb /mnt/sovereign-usb-media
systemd hardening is a footgun when it interacts with services that shell out to scripts touching paths outside their protected tree. The right answer is to declare the writeable paths explicitly, not to disable hardening.
The architecture as it stands now
Tier A, NVMe nightly. systemd timer fires at 02:00, writes /data/backups/sovereign-backup-YYYYMMDD-HHMMSS.tar.gz.age, 14 days retention. Useful for “I deleted the wrong file yesterday”. Not real DR because the NVMe is not physically separable from the DGX Spark.
Tier B, USB rolling. Manual run via dashboard or desktop app, writes to /mnt/sovereign-usb/backups/, 30 days retention. The stick is plugged in for the backup, then unplugged and stored offline. This is the actual disaster-recovery tier.
Encryption. Single age recipient public key. The matching private key (age-identity) lives separately on hardware (BitBox02 paper backup plus an offline copy). The encrypted tar on its own cannot be decrypted without that key.
Logging. /var/log/sovereign-backup.log mode 0644. Currently human-read; the postmortem lesson is that a 25-hour staleness alert would have caught the original silent failure, and that alert is the next concrete addition to the dashboard.
Lessons that generalise beyond backups
A green timer status answers “is this scheduled to run?”, not “did the last run do anything useful?”. Any service in this architecture is a candidate for the same class of silent failure: backup, log rotation, certificate renewal, scheduled training jobs, anything that runs on a timer with no end-to-end verification.
The fix that scales is a smoke test that exercises the actual output. For backups: parse the most recent tar and verify the manifest looks right. For cert renewal: call out to the public endpoint and check the cert expiry moved forward. For log rotation: check that yesterday’s log file exists and is non-empty. The smoke test runs at the same cadence as the job and alerts when its assumptions stop holding.
A second lesson, smaller: read your own dashboard’s data sources. The dashboard scraped timer status, not service status. That distinction was buried two clicks into systemd’s documentation; one afternoon of fixing the dashboard to scrape both would have caught Bugs 1 and 2 within the first week.
Status, 2026-05-07
The rebuild landed 2026-04-14. Tier A has produced a tar nightly since then. Tier B has been written manually a handful of times. No silent failures, no FAT32 traps, no symlink-pointing-at-nothing.
What is not yet in place but should be:
- A smoke test that parses the latest tar’s manifest and counts file entries (would catch a future regression to “tar exists but is empty”)
- A 25-hour staleness alert from the dashboard (would catch a future regression to “timer is green but no tar landed”)
- Off-site replication to a second sovereign box (today the desk drawer is the real DR boundary; a fire in the same room is the worst-case loss)
Each of those is one weekend’s work. Writing them down here makes it harder to forget.
What I Actually Use
age(FiloSottile) for asymmetric file encryption, single recipient model, BitBox02 paper backup of the private key- systemd timer + ERR-trap + atomic temp-file rename for nightly Tier A
- 256 GB Samsung USB-C, ext4 + exFAT split, plugged in only for Tier B runs
/var/log/sovereign-backup.logfor the audit trail (human-read for now, automated alert is next)
Four silent bugs, one green timer
Backup-system postmortem and the rebuilt three-tier architecture