All articles tagged "ops" : self-hosted AI fixes, setups, and architecture notes.
Backing up model weights is the wrong abstraction. Backing up the model identifier, the configuration, the customer data, and the runbook is the right one. The weights are reproducible; the data and the runbook are not.
Read article →
Tailscale is the right pick if your sovereignty budget is finite and the rented coordination server is an acceptable trade. Headscale is the right pick if the coordination server's vendor risk is the dimension you cannot accept. Both ship the same WireGuard underneath.
A step-by-step runbook for getting a DGX Spark back to full production after a power event. Thirty minutes if you have rehearsed; two to six hours if you have not. The procedure assumes a UPS for graceful shutdown and a separate management host.
Prometheus plus Grafana plus one phone number plus the discipline to never alert on something that is not actionable. The observability stack that lets one operator sleep through the night and still catch the failures that matter.
Six unit-file patterns that make a multi-service AI stack survive crashes, reboots, and power events without operator intervention. The patterns are not novel; the discipline of applying them consistently is.