Two Days From Localhost to Production: Building a Hybrid Sovereign AI Site

April 29, 2026 7 min read

New to this stack? The Self-Hosted AI: Start Here hub article is the operational entry point: hardware tree, inference engine choice, and what hurts most after you start. Useful as the orientation for everything else on the blog.

Moving Mistral Small 4 from localhost to a production-ready site in two days hit walls no cloud guide warned me about: unified memory fragmentation, IPv6-blocked model downloads, Docker flags that silently break SGLang. The naive path of “just containerize and deploy” collapsed under 8 GB of residual RAM after a single docker kill. This is not a story about speed for speed’s sake. It is about surviving the handoff from development to a sovereign stack where every byte counts.

Quick Take. Two days from localhost to a sovereign AI site is possible on DGX Spark only if you preempt three failure modes: unified memory exhaustion during Docker restarts, IPv6-blocked Hugging Face downloads, and SGLang’s intolerance for --rm flags. The critical path is not containerization itself, but memory discipline and IPv4-only networking enforced at the OS level.

Memory discipline: the first 12 hours

Twelve hours vanished debugging why SGLang refused to restart after docker kill sglang-mistral4. The container exited cleanly. The GB10’s unified memory held the model’s weights hostage for 30 to 120 seconds. Docker’s --restart unless-stopped did not mask the delay, because memory was not released until the kernel’s page cache flushed.

The fix was not in SGLang’s flags. It was in the systemd unit’s ExecStopPost directive forcing a sync before declaring the service down. Without that, the next container launch inherited a fragmented heap and crashed with OOM. Detail and reproducer in SGLang restart OOM fix.

The lesson generalizes: on unified memory, container exit and memory release are not the same event. Treat them as two distinct steps in your service lifecycle.

IPv4-only networking

Hugging Face’s CDN blocked IPv6 on the DGX Spark. hf download hung indefinitely. The error surfaced as a silent timeout until I ran wget -4 manually and watched 400 MB of weights stall.

The solution was not in HF’s CLI. It was at the host level, in /etc/gai.conf, forcing IPv4 preference for the entire system. CDN edge nodes that drop IPv6 traffic to ARM servers are common but rarely documented, and the DGX Spark’s network stack exposes the asymmetry immediately. Detail in the system-cleanup notes.

The lesson: dual-stack assumptions break on unusual hardware paths. When in doubt, pin IPv4 at the resolver.

SGLang quirks on ARM Blackwell

SGLang’s nightly build was the only version stable on GB10, but it rejected --rm flags because the CUDA context was not cleaned up in time. The required Docker run combination is --restart unless-stopped without --rm. That feels counterintuitive until you trace the CUDA driver’s cleanup sequence.

ARM v9.2-A and GB10 Blackwell do not expose the same lifecycle behavior as x86 GPUs. Generic advice from cloud forums fails. The fixed image tag for this stack is lmsysorg/sglang:nightly-dev-cu13-20260323-999bad5a, the only build that compiles for SM121A. Setup walkthrough in Mistral SGLang setup.

Sovereignty as surface area

Tailscale’s HTTPS gateway carried the production site’s sovereignty, but exposing HTTP ports directly on the DGX Spark was forbidden. Caddy handled TLS termination at port 443. Internal services bound to 127.0.0.1.

The mistake was opening port 80 for a local health check. Within minutes, the DGX Spark’s firewall logged probes from non-sovereign IPs. The fix was trivial: iptables -A INPUT -p tcp --dport 80 -j DROP. The lesson was structural. Sovereignty is not only about data residency. It is about surface area. Every open port is an attack vector, not a convenience. Pattern documented in the mobile terminal setup notes.

Mistral and ComfyUI on shared memory

The final hurdle was the Mistral plus ComfyUI collision. Unified memory meant running both services simultaneously would exhaust RAM. The deployment script enforces a strict sequence: stop ComfyUI, start Mistral, restart ComfyUI only when needed.

Over-provisioning RAM would have violated the DGX Spark’s 128 GB ceiling and forced a hardware upgrade mid-project. Sequential GPU access is the trade. Coordination details in system-cleanup.

Two days, honestly

Two days is achievable. The path is not paved with generic container guides. It is paved with memory discipline, IPv4-only networking, and SGLang’s quirks on ARM Blackwell. The DGX Spark’s unified memory architecture rewards patience over haste. Every shortcut taken in development doubles in production.

The writing of this article took its own shortcut as well: cloud LLM as scaffold, local Mistral for draft, human polish. Sovereign by output, not by every keystroke.

Reproducibility Checklist

Mistral’s review flagged the article for missing reproducibility. Fair. Here’s the exact stack and configuration that produced this site, so you can recreate it (or audit my claims).

Hardware

Local dev: NVIDIA DGX Spark (GB10 Blackwell, ARM v9.2-A, 128 GB unified memory, 4 TB NVMe)
VPS: FlokiNET^{₿Affiliate link. You support sovgrid at no extra cost to you. See /support.} ^↗ EU VPS II: Debian 13 Trixie, x86_64, 2 GB RAM, 50 GB Enterprise NVMe, ~€163/year, paid in bitcoin

Software versions (production)

Component	Version
OS (VPS)	Debian 13.0, kernel 6.12.74+deb13+1-cloud-amd64
Caddy	2-builder + `github.com/mholt/caddy-ratelimit` plugin (xcaddy build)
Docker CE	29.4.1 (official `download.docker.com` repo, not Debian’s `docker.io`)
Compose	v5.1.3 (`docker compose` plugin, not legacy v1)
Astro	5.18.x with `@astrojs/sitemap`, `astro-robots-txt`
nginx (in container)	nginx:alpine, custom config
FastMCP	1.x, Python 3.12, uvicorn, scikit-learn for TF-IDF
Inference (local)	SGLang nightly-dev-cu13-20260323, CUDA 13.0
Model	Mistral Small 4 119B NVFP4 + EAGLE draft-head

Critical config files

All committed in cipherfox/sovereign-blog and cipherfox/sovereign-grid-docs (private Gitea, mirrors available on request):

~/sovereign-blog/Caddyfile: reverse proxy + rate-limit + log routing + .well-known CORS
~/sovereign-blog/Dockerfile.caddy: xcaddy with caddy-ratelimit plugin
~/sovereign-blog/docker-compose.https.yml: blog + caddy services, volumes for caddy_data/caddy_config/logs/srv
~/sovereign-blog/nginx.conf: listen 4321 + absolute_redirect off; port_in_redirect off; server_name_in_redirect off;
~/sovereign-mcp/Dockerfile: Python 3.12-slim + uv for deps
~/sovereign-mcp/docker-compose.yml: mcp service joining external sovereign-blog_default network
/etc/ssh/sshd_config.d/99-hardening.conf: PermitRootLogin no, PasswordAuthentication no, MaxAuthTries 3, AllowUsers cipherfox
/etc/fail2ban/filter.d/caddy-mcp.conf + /etc/fail2ban/jail.d/caddy-mcp.local: 30×429/10min → 1h ban
/etc/apt/apt.conf.d/52unattended-local: auto-reboot 04:00 UTC
~/scripts/nsm-aggregate.py + ~/scripts/nsm-init.sh: daily aggregator + idempotent setup wrapper
~/sovereign-blog/srv/robots.txt: User-agent: *\nDisallow: /\n for the MCP host

One-shot bootstrap

After provisioning the VPS with Debian 13 and adding an SSH public key:

# 1. Hardening (run once as root via sudo on the VPS)
sudo bash ~/scripts/nsm-init.sh   # chmod logs, install cron, install user-crontab
# Plus: install ufw + fail2ban via apt, deploy sshd_config.d/99-hardening.conf

# 2. Docker official repo + Compose v2
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian trixie stable" | sudo tee /etc/apt/sources.list.d/docker.list
sudo apt update && sudo apt install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
sudo usermod -aG docker $USER

# 3. Caddy custom build with ratelimit plugin
cd ~/sovereign-blog
docker compose -f docker-compose.https.yml up -d --build

# 4. MCP container
cd ~/sovereign-mcp
docker compose up -d --build

# 5. Verify
curl -I https://sovgrid.org/
curl -s https://mcp.sovgrid.org/health

Benchmark numbers (own measurements)

Mistral Small 4 119B NVFP4 + EAGLE on GB10: ~41 tok/s output (single-stream, EAGLE accept rate 2.5-3.4)
Same model without EAGLE: 12-15 tok/s
Context length: 65 536 tokens
Memory utilization: 75 % static (--mem-fraction-static 0.75)
PageSpeed Insights mobile after font-subsetting: 96, desktop: 100
Caddy + Let’s-Encrypt-Cert acquisition: 6 seconds (HTTP-01 challenge)
Initial HTML page weight (gzipped): 12 KB

Failure modes recreated

The five fixes referenced above are documented as standalone articles in /blog/:

fixes-sglang-restart-oom-fix: ExecStopPost=/bin/sync + 60s wait before restart
fixes-system-cleanup: /etc/gai.conf IPv4 preference for HF downloads
fixes-cloudflared-astro-migration-2026-04-04: port 4321 → Caddy reverse-proxy migration
fixes-vibe-write-file-overwrite: race condition in Vibe’s edit pipeline
fixes-sglang-vibe-performance-benchmark: empirical EAGLE accept-rate measurements

Each article includes the exact failing command output and the fix applied. Where a fix was a one-line systemd directive, that line is in the article verbatim. Where a fix was a sequence (stop service → wait for cleanup → restart), the script lives at the path referenced.

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—