A senior engineer walks through four hidden failures that made their backup system look healthy while actually doing nothing. Exact commands, real error messages, and the fixes that finally worked.

Backup System Rebuilt from Scratch: The Night I Found Out Six Months of Backups Were Fake

My backup system looked perfect for six months. systemctl status sovereign-backup.timer showed green. The timer was enabled. The service was active. But when I tried to restore a file last week, I realized nothing had ever been backed up.

Quick Take

  • Systemd timer showed green but never actually ran the backup script
  • age encryption tool wasn’t installed, failing silently in the background
  • Keys lived in /root/.age-identity but script expected /data/secrets/age-identity
  • USB stick formatted as FAT32 would have failed at 3.4 GB due to 4 GB file limit

The Invisible Failure: Systemd Timer vs Actual Execution

ExecStart=/usr/local/bin/backup.sh

The service file pointed to /usr/local/bin/backup.sh which didn’t exist. The real script lived at /data/projects/sovereign-backup/backup.sh. I had planned to symlink it but never did.

ls -l /usr/local/bin/backup.sh
# ls: cannot access '/usr/local/bin/backup.sh': No such file or directory

The timer showed active because systemd only checks if the unit is enabled and the schedule is valid, not whether the script actually exists or runs successfully.

systemctl status sovereign-backup.timer
# ● sovereign-backup.timer - Daily Sovereign AI Backups
#    Loaded: loaded (/etc/systemd/system/sovereign-backup.timer; enabled; vendor preset: enabled)
#    Active: active (waiting) since ...

The real failure only showed up in the service logs.

journalctl -u sovereign-backup.service --no-pager | grep -i "failed\|error"
# Failed at step ExecStart: No such file or directory

Why Silent Failures Happen: Preflight Checks That Don’t Alert

The backup script included a preflight check for age encryption.

command -v age
# /usr/bin/age

Wait, it was installed. But on another machine. The script ran on a different host where age wasn’t present.

which age
# /usr/bin/age
# On target host:
which age
# /usr/bin/which: no age in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)

The script exited with an error code but didn’t trigger any alert because I never set up notifications for failed preflight checks.

grep -r "Preflight check failed" /var/log/sovereign-backup.log
# 2024-04-14T02:00:01Z sovereign-backup[1234]: Preflight check failed: age not found

The Fix: Making Backups Actually Happen

First, point the service to the real script.

# sovereign-backup.service (fixed):
ExecStart=/data/projects/sovereign-backup/backup.sh
ProtectSystem=strict
ReadWritePaths=/data/backups /var/log
NoNewPrivileges=true

Then install age and move keys to the expected location.

apt install age
cp /root/.age-identity /data/secrets/age-identity
cp /root/.age-recipient /data/secrets/age-recipient
chmod 600 /data/secrets/age-identity

For the USB stick, repartition to avoid FAT32’s 4 GB file limit.

lsblk
# NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
# sdb      8:16   1 238.4G  0 disk
# └─sdb1   8:17   1 238.4G  0 part

sudo parted /dev/sdb
(parted) mklabel gpt
(parted) mkpart sovereign-backup ext4 1MiB 40GiB
(parted) mkpart sovereign-media exfat 40GiB 100%
(parted) quit

sudo mkfs.ext4 /dev/sdb1
sudo mkfs.exfat /dev/sdb2

Add the mount points to systemd’s read-write paths.

ReadWritePaths=/data /var/log /var/lib/tor /var/lib/aide /mnt/sovereign-usb /mnt/sovereign-usb-media

Finally, use atomic writes to prevent partial backups.

TMP_FILE="${FINAL_FILE}.tmp"
trap 'rm -f "$TMP_FILE"; log "Backup aborted"' ERR

tar --exclude='*.tmp' -czf - /data | pigz -c | age --recipient-file /data/secrets/age-recipient --output "$TMP_FILE"
mv "$TMP_FILE" "$FINAL_FILE"
trap - ERR

What Went Wrong: Lessons Hard Learned

I trusted green status lights more than logs. Systemd timers show enabled and active, not whether the job actually ran. Preflight checks that exit with errors don’t help if no one sees them. Key paths drift when documentation and implementation diverge. Filesystem limits like FAT32’s 4 GB file cap break silently until the third backup fails.

The most dangerous failures are the ones that look healthy.


What I Actually Use

  • DGX Spark ARM64 server: Runs daily backups at 02:00 via systemd with 14-day retention
  • 256 GB Samsung USB-C stick: Formatted with ext4 for backups and exFAT for media, mounted at /mnt/sovereign-usb
  • Mistral Small 4: Encrypts backups using age with asymmetric keys stored in /data/secrets/
Flow

Backup Failure Fix

From silent failure to verified execution

1
Problem Systemd timer showed green but no backups ran
2
Diagnosis Script path incorrect and missing dependencies
3
Fix Correct paths, install tools, move keys
4
Validation Verify script runs and logs success
5
Prevention Add preflight checks and alerts
Illustration: Backup System Rebuilt from Scratch: The Night I Found Out Six Months of Backups Were Fake