← All articles

#ops

5 articles

All articles tagged "ops" : self-hosted AI fixes, setups, and architecture notes.

Backing Up 119B Parameters Without Going Bankrupt on Storage

Backing Up 119B Parameters Without Going Bankrupt on Storage

Backing up model weights is the wrong abstraction. Backing up the model identifier, the configuration, the customer data, and the runbook is the right one. The weights are reproducible; the data and the runbook are not.

Read article →
Tailscale is the right pick if your sovereignty budget is finite and the rented coordination server is an acceptable trade. Headscale is the right pick if the coordination server's vendor risk is the dimension you cannot accept. Both ship the same WireGuard underneath.
comparison

Tailscale vs Headscale for Multi-Box Sovereign Stacks

Tailscale is the right pick if your sovereignty budget is finite and the rented coordination server is an acceptable trade. Headscale is the right pick if the coordination server's vendor risk is the dimension you cannot accept. Both ship the same WireGuard underneath.

A step-by-step runbook for getting a DGX Spark back to full production after a power event. Thirty minutes if you have rehearsed; two to six hours if you have not. The procedure assumes a UPS for graceful shutdown and a separate management host.
tutorialdgx-spark

Power Failure Recovery on a DGX Spark: The 30-Minute Procedure

A step-by-step runbook for getting a DGX Spark back to full production after a power event. Thirty minutes if you have rehearsed; two to six hours if you have not. The procedure assumes a UPS for graceful shutdown and a separate management host.

Prometheus plus Grafana plus one phone number plus the discipline to never alert on something that is not actionable. The observability stack that lets one operator sleep through the night and still catch the failures that matter.
tutorial

Self-Hosted Observability for a One-Person AI Stack

Prometheus plus Grafana plus one phone number plus the discipline to never alert on something that is not actionable. The observability stack that lets one operator sleep through the night and still catch the failures that matter.

Six unit-file patterns that make a multi-service AI stack survive crashes, reboots, and power events without operator intervention. The patterns are not novel; the discipline of applying them consistently is.
tutorial

systemd Patterns for Self-Hosted AI Services

Six unit-file patterns that make a multi-service AI stack survive crashes, reboots, and power events without operator intervention. The patterns are not novel; the discipline of applying them consistently is.