Green systemd timer, healthy logs, four silent bugs. The backup script never executed once. Here is the postmortem and the rebuilt three-tier architecture that replaced it.

My Backup Ran for Six Weeks Without Backing Anything Up

For six weeks (the entire life of this DGX Spark so far) systemctl status sovereign-backup.timer showed green. The journal showed clean exits. No errors, no alerts, no missed schedules on the dashboard. Not a single backup tar was ever written. Four independent bugs lined up to produce a system that looked like it was working and was not. This is the postmortem and the rebuild that replaced it.

Quick Take

  • A green systemd timer status reports the schedule, not the job exit
  • Four silent bugs combined to keep the backup script from ever executing successfully
  • The fix was a rewrite, not a patch: new service file, USB repartition, atomic writes, ERR-trap
  • Final architecture: Tier A NVMe nightly (14d), Tier B USB rolling (30d), age-encrypted, single recipient key
  • Lesson generalises: any service with enabled + active and no end-to-end smoke test is a candidate for the same class of failure

What looked fine

The setup was straightforward, on paper. A systemd timer fires sovereign-backup.service nightly at 02:00. The service runs /usr/local/bin/backup.sh, which tars /data/projects plus /data/secrets plus a few other directories, pipes through age --recipient to encrypt, writes the tarball to /data/backups/, prunes anything older than 14 days, exits zero.

systemctl list-timers showed the timer scheduled correctly, last-fired stamps moved nightly, the unit stayed active. The dashboard pulled timer status from the same source and rendered it green. I never set up an alert because the timer never reported a failure to alert on.

The first time I needed to restore a deleted file, the backup directory was empty. Six weeks of empty.

The unit file shipped with this:

# /etc/systemd/system/sovereign-backup.service (broken)
[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh

The actual script lived at /data/projects/sovereign-backup/backup.sh. The deploy procedure assumed a symlink at /usr/local/bin/backup.sh that pointed at the real script. The symlink was planned and never made. systemd dutifully ran /usr/local/bin/backup.sh every night, got a No such file or directory exit, and moved on. The timer’s success state reflected only “the timer fired”, not “the service did anything useful”.

systemctl status sovereign-backup.timer shows timer health. systemctl status sovereign-backup.service would have shown the failed exits, but the dashboard scraped only the timer.

Bug 2: age was not installed at all

The backup script’s preflight checked for age:

command -v age >/dev/null || { echo "age not installed" >&2; exit 1; }

The check worked correctly: age was not on the system, the script exited 1, and the message went into the journal. Nothing read the journal. The dashboard, which was never wired up to actually parse journalctl -u sovereign-backup.service output, did not surface the preflight failure. Even after Bug 1 was found and fixed, the script would still have exited at the preflight without ever encrypting a tar.

Bug 3: key paths in the script did not match where the keys lived

age-keygen writes to ~/.age-identity and ~/.age-recipient by default. As root that resolves to /root/.age-identity and /root/.age-recipient. The backup script referenced /data/secrets/age-identity and /data/secrets/age-recipient. The setup documentation said one path; the implementation referenced another. Even if Bugs 1 and 2 had been fixed, the script would have failed when reading the recipient key.

The fix is structural: the script and the setup doc both reference /data/secrets/age-identity and /data/secrets/age-recipient, the keys are migrated there once, permissions tightened to chmod 600.

Bug 4: the USB stick was FAT32

The first USB stick I plugged in for Tier B was FAT32, the factory format on most consumer sticks. Compressed Sparky tar comes in around 1.2 GB on a quiet day. Day 1 backup: 1.2 GB, fits. Day 2: another 1.2 GB. Day 3: the running combined archive crossed the 4 GB FAT32 file-size limit and tar failed with File too large. Tar exited non-zero, age never got input, no encrypted file landed on the USB.

This bug had a different signature than the other three: it produces a real visible error in the journal, but only after several days of nominally-working runs. It would have been caught by an end-to-end smoke test that wrote a synthetic large file and read it back. There was no such test.

The rebuild

Once it was clear the failure was structural, not patchable, I rewrote the backup system end-to-end. The new shape:

# /etc/systemd/system/sovereign-backup.service (corrected)
[Unit]
Description=Sovereign nightly local backup
After=network-online.target

[Service]
Type=oneshot
ExecStart=/data/projects/sovereign-backup/backup.sh
ProtectSystem=strict
ReadWritePaths=/data/backups /var/log
NoNewPrivileges=true
StandardOutput=journal
StandardError=journal
# /etc/systemd/system/sovereign-backup.timer
[Unit]
Description=Run nightly local backup

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
RandomizedDelaySec=15min

[Install]
WantedBy=timers.target

The script fix:

# /data/projects/sovereign-backup/backup.sh: atomic write + ERR trap
set -euo pipefail
RECIPIENT_FILE=/data/secrets/age-recipient
DEST_DIR=${1:-/data/backups}
NAME="sovereign-backup-$(date +%Y%m%d-%H%M%S).tar.gz.age"
FINAL_FILE="${DEST_DIR}/${NAME}"
TMP_FILE="${FINAL_FILE}.tmp"

trap 'rm -f "$TMP_FILE"; logger -t sovereign-backup "aborted, tmp removed"' ERR

tar --exclude='node_modules' --exclude='.cache' \
    -cf - /data/projects /data/secrets /data/gitea \
  | pigz -c \
  | age --recipient-file "$RECIPIENT_FILE" --output "$TMP_FILE"

mv "$TMP_FILE" "$FINAL_FILE"
trap - ERR

# Retention prune
find "$DEST_DIR" -name 'sovereign-backup-*.tar.gz.age' -mtime +14 -delete

The atomic temp-file plus ERR-trap pattern means a partial tar never lands at the canonical filename. Either the full encrypted tarball moves into place, or nothing does, plus a journal entry that the dashboard now actually scrapes.

The USB stick, repartitioned

A 256 GB Samsung USB-C stick split into two partitions:

PartitionSizeFormatMount
sdb140 GBext4/mnt/sovereign-usb (encrypted backups)
sdb2~199 GBexFAT/mnt/sovereign-usb-media (cross-platform media)

ext4 for the backup partition: no file-size cap, supports POSIX permissions and atomic rename. exFAT for the media partition: macOS, Windows, Android can read it, no 4 GB limit. The media partition was a side-benefit, not part of the backup story; it just happens to use the same physical stick.

The fifth bug, found while testing the rebuild

After fixing the four original bugs and rebuilding, the dashboard’s “Backup to USB” button still failed. The dashboard service had ProtectSystem=strict. The button shelled out to sudo backup-to-usb.sh, which tried to write into /mnt/sovereign-usb. systemd’s hardening blocked the write with a read-only file system error. Adding the USB mountpoints to ReadWritePaths fixed it:

# Dashboard service unit
ReadWritePaths=/data /var/log /var/lib/tor /var/lib/aide \
               /mnt/sovereign-usb /mnt/sovereign-usb-media

systemd hardening is a footgun when it interacts with services that shell out to scripts touching paths outside their protected tree. The right answer is to declare the writeable paths explicitly, not to disable hardening.

The architecture as it stands now

Tier A, NVMe nightly. systemd timer fires at 02:00, writes /data/backups/sovereign-backup-YYYYMMDD-HHMMSS.tar.gz.age, 14 days retention. Useful for “I deleted the wrong file yesterday”. Not real DR because the NVMe is not physically separable from the DGX Spark.

Tier B, USB rolling. Manual run via dashboard or desktop app, writes to /mnt/sovereign-usb/backups/, 30 days retention. The stick is plugged in for the backup, then unplugged and stored offline. This is the actual disaster-recovery tier.

Encryption. Single age recipient public key. The matching private key (age-identity) lives separately on hardware (BitBox02 paper backup plus an offline copy). The encrypted tar on its own cannot be decrypted without that key.

Logging. /var/log/sovereign-backup.log mode 0644. Currently human-read; the postmortem lesson is that a 25-hour staleness alert would have caught the original silent failure, and that alert is the next concrete addition to the dashboard.

Lessons that generalise beyond backups

A green timer status answers “is this scheduled to run?”, not “did the last run do anything useful?”. Any service in this architecture is a candidate for the same class of silent failure: backup, log rotation, certificate renewal, scheduled training jobs, anything that runs on a timer with no end-to-end verification.

The fix that scales is a smoke test that exercises the actual output. For backups: parse the most recent tar and verify the manifest looks right. For cert renewal: call out to the public endpoint and check the cert expiry moved forward. For log rotation: check that yesterday’s log file exists and is non-empty. The smoke test runs at the same cadence as the job and alerts when its assumptions stop holding.

A second lesson, smaller: read your own dashboard’s data sources. The dashboard scraped timer status, not service status. That distinction was buried two clicks into systemd’s documentation; one afternoon of fixing the dashboard to scrape both would have caught Bugs 1 and 2 within the first week.

Status, 2026-05-07

The rebuild landed 2026-04-14. Tier A has produced a tar nightly since then. Tier B has been written manually a handful of times. No silent failures, no FAT32 traps, no symlink-pointing-at-nothing.

What is not yet in place but should be:

Each of those is one weekend’s work. Writing them down here makes it harder to forget.

What I Actually Use

  • age (FiloSottile) for asymmetric file encryption, single recipient model, BitBox02 paper backup of the private key
  • systemd timer + ERR-trap + atomic temp-file rename for nightly Tier A
  • 256 GB Samsung USB-C, ext4 + exFAT split, plugged in only for Tier B runs
  • /var/log/sovereign-backup.log for the audit trail (human-read for now, automated alert is next)
Flow

Four silent bugs, one green timer

Backup-system postmortem and the rebuilt three-tier architecture

1
Bug 1 ExecStart pointed at a non-existent symlink
2
Bug 2 age binary not installed, preflight failed silently
3
Bug 3 Key paths in script differed from where keys actually lived
4
Bug 4 USB stick formatted FAT32, 4 GB file-size limit
5
Hidden 5th systemd ProtectSystem=strict blocked the dashboard write path
Illustration: My Backup Ran for Six Weeks Without Backing Anything Up