My Backup Ran for Six Weeks Without Backing Anything Up

May 7, 2026 9 min read

For six weeks (the entire life of this DGX Spark so far) systemctl status sovereign-backup.timer showed green. The journal showed clean exits. No errors, no alerts, no missed schedules on the dashboard. Not a single backup tar was ever written. Four independent bugs lined up to produce a system that looked like it was working and was not. This is the postmortem and the rebuild that replaced it.

Where this went: the single-host script below was later generalised and released as sovereign-backup (MIT, pure bash, config-driven, multi-host). Details in the update at the end.

Quick Take

A green systemd timer status reports the schedule, not the job exit

Four silent bugs combined to keep the backup script from ever executing successfully

The fix was a rewrite, not a patch: new service file, USB repartition, atomic writes, ERR-trap

Final architecture: Tier A NVMe nightly (14d), Tier B USB rolling (30d), age-encrypted, single recipient key

Lesson generalises: any service with enabled + active and no end-to-end smoke test is a candidate for the same class of failure

What looked fine

The setup was straightforward, on paper. A systemd timer fires sovereign-backup.service nightly at 02:00. The service runs /usr/local/bin/backup.sh, which tars /data/projects plus /data/secrets plus a few other directories, pipes through age --recipient to encrypt, writes the tarball to /data/backups/, prunes anything older than 14 days, exits zero.

systemctl list-timers showed the timer scheduled correctly, last-fired stamps moved nightly, the unit stayed active. The dashboard pulled timer status from the same source and rendered it green. I never set up an alert because the timer never reported a failure to alert on.

The first time I needed to restore a deleted file, the backup directory was empty. Six weeks of empty.

Bug 1: `ExecStart` pointed at a symlink that was never created

The unit file shipped with this:

# /etc/systemd/system/sovereign-backup.service (broken)
[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh

The actual script lived at /data/projects/sovereign-backup/backup.sh. The deploy procedure assumed a symlink at /usr/local/bin/backup.sh that pointed at the real script. The symlink was planned and never made. systemd dutifully ran /usr/local/bin/backup.sh every night, got a No such file or directory exit, and moved on. The timer’s success state reflected only “the timer fired”, not “the service did anything useful”.

systemctl status sovereign-backup.timer shows timer health. systemctl status sovereign-backup.service would have shown the failed exits, but the dashboard scraped only the timer.

Bug 2: `age` was not installed at all

The backup script’s preflight checked for age:

command -v age >/dev/null || { echo "age not installed" >&2; exit 1; }

The check worked correctly: age was not on the system, the script exited 1, and the message went into the journal. Nothing read the journal. The dashboard, which was never wired up to actually parse journalctl -u sovereign-backup.service output, did not surface the preflight failure. Even after Bug 1 was found and fixed, the script would still have exited at the preflight without ever encrypting a tar.

Bug 3: key paths in the script did not match where the keys lived

age-keygen writes to ~/.age-identity and ~/.age-recipient by default. As root that resolves to /root/.age-identity and /root/.age-recipient. The backup script referenced /data/secrets/age-identity and /data/secrets/age-recipient. The setup documentation said one path; the implementation referenced another. Even if Bugs 1 and 2 had been fixed, the script would have failed when reading the recipient key.

The fix is structural: the script and the setup doc both reference /data/secrets/age-identity and /data/secrets/age-recipient, the keys are migrated there once, permissions tightened to chmod 600.

Bug 4: the USB stick was FAT32

The first USB stick I plugged in for Tier B was FAT32, the factory format on most consumer sticks. Compressed Sparky tar comes in around 1.2 GB on a quiet day. Day 1 backup: 1.2 GB, fits. Day 2: another 1.2 GB. Day 3: the running combined archive crossed the 4 GB FAT32 file-size limit and tar failed with File too large. Tar exited non-zero, age never got input, no encrypted file landed on the USB.

This bug had a different signature than the other three: it produces a real visible error in the journal, but only after several days of nominally-working runs. It would have been caught by an end-to-end smoke test that wrote a synthetic large file and read it back. There was no such test.

The rebuild

Once it was clear the failure was structural, not patchable, I rewrote the backup system end-to-end. The new shape:

# /etc/systemd/system/sovereign-backup.service (corrected)
[Unit]
Description=Sovereign nightly local backup
After=network-online.target

[Service]
Type=oneshot
ExecStart=/data/projects/sovereign-backup/backup.sh
ProtectSystem=strict
ReadWritePaths=/data/backups /var/log
NoNewPrivileges=true
StandardOutput=journal
StandardError=journal

# /etc/systemd/system/sovereign-backup.timer
[Unit]
Description=Run nightly local backup

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
RandomizedDelaySec=15min

[Install]
WantedBy=timers.target

The script fix:

# /data/projects/sovereign-backup/backup.sh: atomic write + ERR trap
set -euo pipefail
RECIPIENT_FILE=/data/secrets/age-recipient
DEST_DIR=${1:-/data/backups}
NAME="sovereign-backup-$(date +%Y%m%d-%H%M%S).tar.gz.age"
FINAL_FILE="${DEST_DIR}/${NAME}"
TMP_FILE="${FINAL_FILE}.tmp"

trap 'rm -f "$TMP_FILE"; logger -t sovereign-backup "aborted, tmp removed"' ERR

tar --exclude='node_modules' --exclude='.cache' \
    -cf - /data/projects /data/secrets /data/gitea \
  | pigz -c \
  | age --recipient-file "$RECIPIENT_FILE" --output "$TMP_FILE"

mv "$TMP_FILE" "$FINAL_FILE"
trap - ERR

# Retention prune
find "$DEST_DIR" -name 'sovereign-backup-*.tar.gz.age' -mtime +14 -delete

The atomic temp-file plus ERR-trap pattern means a partial tar never lands at the canonical filename. Either the full encrypted tarball moves into place, or nothing does, plus a journal entry that the dashboard now actually scrapes.

The USB stick, repartitioned

A 256 GB Samsung USB-C stick split into two partitions:

Partition	Size	Format	Mount
sdb1	40 GB	ext4	`/mnt/sovereign-usb` (encrypted backups)
sdb2	~199 GB	exFAT	`/mnt/sovereign-usb-media` (cross-platform media)

ext4 for the backup partition: no file-size cap, supports POSIX permissions and atomic rename. exFAT for the media partition: macOS, Windows, Android can read it, no 4 GB limit. The media partition was a side-benefit, not part of the backup story; it just happens to use the same physical stick.

The fifth bug, found while testing the rebuild

After fixing the four original bugs and rebuilding, the dashboard’s “Backup to USB” button still failed. The dashboard service had ProtectSystem=strict. The button shelled out to sudo backup-to-usb.sh, which tried to write into /mnt/sovereign-usb. systemd’s hardening blocked the write with a read-only file system error. Adding the USB mountpoints to ReadWritePaths fixed it:

# Dashboard service unit
ReadWritePaths=/data /var/log /var/lib/tor /var/lib/aide \
               /mnt/sovereign-usb /mnt/sovereign-usb-media

systemd hardening is a footgun when it interacts with services that shell out to scripts touching paths outside their protected tree. The right answer is to declare the writeable paths explicitly, not to disable hardening.

The architecture as it stands now

Tier A, NVMe nightly. systemd timer fires at 02:00, writes /data/backups/sovereign-backup-YYYYMMDD-HHMMSS.tar.gz.age, 14 days retention. Useful for “I deleted the wrong file yesterday”. Not real DR because the NVMe is not physically separable from the DGX Spark.

Tier B, USB rolling. Manual run via dashboard or desktop app, writes to /mnt/sovereign-usb/backups/, 30 days retention. The stick is plugged in for the backup, then unplugged and stored offline. This is the actual disaster-recovery tier.

Encryption. Single age recipient public key. The matching private key (age-identity) lives separately on hardware (BitBox02^{₿Affiliate link. You support sovgrid at no extra cost to you. See /support.} paper backup plus an offline copy). The encrypted tar on its own cannot be decrypted without that key.

Logging. /var/log/sovereign-backup.log mode 0644. Currently human-read; the postmortem lesson is that a 25-hour staleness alert would have caught the original silent failure, and that alert is the next concrete addition to the dashboard.

Lessons that generalise beyond backups

A green timer status answers “is this scheduled to run?”, not “did the last run do anything useful?”. Any service in this architecture is a candidate for the same class of silent failure: backup, log rotation, certificate renewal, scheduled training jobs, anything that runs on a timer with no end-to-end verification.

The fix that scales is a smoke test that exercises the actual output. For backups: parse the most recent tar and verify the manifest looks right. For cert renewal: call out to the public endpoint and check the cert expiry moved forward. For log rotation: check that yesterday’s log file exists and is non-empty. The smoke test runs at the same cadence as the job and alerts when its assumptions stop holding.

A second lesson, smaller: read your own dashboard’s data sources. The dashboard scraped timer status, not service status. That distinction was buried two clicks into systemd’s documentation; one afternoon of fixing the dashboard to scrape both would have caught Bugs 1 and 2 within the first week.

Status, 2026-05-07

The rebuild landed 2026-04-14. Tier A has produced a tar nightly since then. Tier B has been written manually a handful of times. No silent failures, no FAT32 traps, no symlink-pointing-at-nothing.

What is not yet in place but should be:

A smoke test that parses the latest tar’s manifest and counts file entries (would catch a future regression to “tar exists but is empty”)
A 25-hour staleness alert from the dashboard (would catch a future regression to “timer is green but no tar landed”)
Off-site replication to a second sovereign box (today the desk drawer is the real DR boundary; a fire in the same room is the worst-case loss)

Each of those is one weekend’s work. Writing them down here makes it harder to forget.

What I Actually Use

age (FiloSottile) for asymmetric file encryption, single recipient model, BitBox02^{₿Affiliate link. You support sovgrid at no extra cost to you. See /support.} paper backup of the private key

systemd timer + ERR-trap + atomic temp-file rename for nightly Tier A

256 GB Samsung USB-C, ext4 + exFAT split, plugged in only for Tier B runs

/var/log/sovereign-backup.log for the audit trail (human-read for now, automated alert is next)

Update (2026-06-10): this became an open-source tool

The single-host script in this postmortem was later generalised and released as sovereign-backup (MIT, pure bash). The hardcoded source list and the single age recipient moved into per-host YAML, so one generic bin/sovereign-backup runs across several machines instead of a script edited per box. It ships with a smoke-tested CLI (--dry-run, deterministic exit codes, restore --verify) that exercises the encrypt-and-restore path in CI before you trust it. The three-tier shape and the age single-recipient encryption model described above are unchanged.

Flow

Four silent bugs, one green timer

Backup-system postmortem and the rebuilt three-tier architecture

Bug 1 ExecStart pointed at a non-existent symlink

Bug 2 age binary not installed, preflight failed silently

Bug 3 Key paths in script differed from where keys actually lived

Bug 4 USB stick formatted FAT32, 4 GB file-size limit

Hidden 5th systemd ProtectSystem=strict blocked the dashboard write path

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—