Reading path
I'm operating this stack day-to-day
Day-zero is one thing. Day-N is the part nobody writes about. Service lifecycle, power events, backup, network hardening, file integrity, and the mental model of the memory you are actually managing.
10 articles, in reading order
- systemd Patterns for Self-Hosted AI Services
Service lifecycle: six unit-file patterns that make a multi-service AI stack survive crashes and reboots without operator intervention.
- Power Failure Recovery on a DGX Spark: The 30-Minute Procedure
Power-event recovery: the thirty-minute procedure from cold boot to inference resumed, with the specific failure modes that show up in the first ten minutes.
- Backing Up 119B Parameters Without Going Bankrupt on Storage
Backup discipline: keeping a 119B-parameter model and the surrounding state durable without paying for object-storage at model-weight scale.
- Caddy + Cloudflare Tunnel: The Reliability Pattern
Network edge: the Caddy-plus-tunnel pattern that survives ISP-side reconfiguration, with the receipts from the 2026-05 Cloudflared retirement.
- AIDE + Tripwire for AI Boxes: When File Integrity Matters
File integrity: when AIDE or Tripwire on an AI box is the right intrusion-detection layer, and the operational rules to run them without alert fatigue.
- The Unified-Memory Inference Mental Model
The resource you are actually managing: a unified memory pool shared across LLM, TTS, and image services, with the mental model that makes the sequencing rules feel obvious instead of arbitrary.
- Self-Hosted Observability for a One-Person AI Stack
Observability for a one-person stack: which Prometheus exporters earn their keep, which dashboards stay quiet on a good day, and the alert-fatigue ceiling that defines what gets watched.
- Tailscale vs Headscale for Multi-Box Sovereign Stacks
Network plane: Tailscale managed versus Headscale self-hosted, the control-plane sovereignty trade-off, and the DERP-relay decision that sits underneath both.
- The Operator's Guide to Self-Hosted Lightning
Lightning ops for the daily driver: channel selection, fee-policy updates, channel-backup automation, and the inbound-liquidity bootstrap problem.
- Gitea as Source-of-Truth for AI Pipelines
Source-of-truth for AI pipelines: why Gitea on loopback over GitHub, the pull-before-read ritual, and the webhook-versus-polling decision.