HUB Two Days From Localhost to Production: Building a Hybrid Sovereign AI Site
Moving Mistral Small 4 from localhost to a production-ready site in two days hit walls no cloud guide warned me about: unified memory fragmentation, IPv6-blocked model downloads, Docker flags that silently break SGLang. The naive path of “just containerize and deploy” collapsed under 8 GB of residual RAM after a single docker kill. This is not a story about speed for speed’s sake. It is about surviving the handoff from development to a sovereign stack where every byte counts.
Quick Take. Two days from localhost to a sovereign AI site is possible on DGX Spark only if you preempt three failure modes: unified memory exhaustion during Docker restarts, IPv6-blocked Hugging Face downloads, and SGLang’s intolerance for
--rmflags. The critical path is not containerization itself, but memory discipline and IPv4-only networking enforced at the OS level.
Memory discipline: the first 12 hours
Twelve hours vanished debugging why SGLang refused to restart after docker kill sglang-mistral4. The container exited cleanly. The GB10’s unified memory held the model’s weights hostage for 30 to 120 seconds. Docker’s --restart unless-stopped did not mask the delay, because memory was not released until the kernel’s page cache flushed.
The fix was not in SGLang’s flags. It was in the systemd unit’s ExecStopPost directive forcing a sync before declaring the service down. Without that, the next container launch inherited a fragmented heap and crashed with OOM. Detail and reproducer in SGLang restart OOM fix.
The lesson generalizes: on unified memory, container exit and memory release are not the same event. Treat them as two distinct steps in your service lifecycle.
IPv4-only networking
Hugging Face’s CDN blocked IPv6 on the DGX Spark. hf download hung indefinitely. The error surfaced as a silent timeout until I ran wget -4 manually and watched 400 MB of weights stall.
The solution was not in HF’s CLI. It was at the host level, in /etc/gai.conf, forcing IPv4 preference for the entire system. CDN edge nodes that drop IPv6 traffic to ARM servers are common but rarely documented, and the DGX Spark’s network stack exposes the asymmetry immediately. Detail in the system-cleanup notes.
The lesson: dual-stack assumptions break on unusual hardware paths. When in doubt, pin IPv4 at the resolver.
SGLang quirks on ARM Blackwell
SGLang’s nightly build was the only version stable on GB10, but it rejected --rm flags because the CUDA context was not cleaned up in time. The required Docker run combination is --restart unless-stopped without --rm. That feels counterintuitive until you trace the CUDA driver’s cleanup sequence.
ARM v9.2-A and GB10 Blackwell do not expose the same lifecycle behavior as x86 GPUs. Generic advice from cloud forums fails. The fixed image tag for this stack is lmsysorg/sglang:nightly-dev-cu13-20260323-999bad5a, the only build that compiles for SM121A. Setup walkthrough in Mistral SGLang setup.
Sovereignty as surface area
Tailscale’s HTTPS gateway carried the production site’s sovereignty, but exposing HTTP ports directly on the DGX Spark was forbidden. Caddy handled TLS termination at port 443. Internal services bound to 127.0.0.1.
The mistake was opening port 80 for a local health check. Within minutes, the DGX Spark’s firewall logged probes from non-sovereign IPs. The fix was trivial: iptables -A INPUT -p tcp --dport 80 -j DROP. The lesson was structural. Sovereignty is not only about data residency. It is about surface area. Every open port is an attack vector, not a convenience. Pattern documented in the mobile terminal setup notes.
Mistral and ComfyUI on shared memory
The final hurdle was the Mistral plus ComfyUI collision. Unified memory meant running both services simultaneously would exhaust RAM. The deployment script enforces a strict sequence: stop ComfyUI, start Mistral, restart ComfyUI only when needed.
Over-provisioning RAM would have violated the DGX Spark’s 128 GB ceiling and forced a hardware upgrade mid-project. Sequential GPU access is the trade. Coordination details in system-cleanup.
Two days, honestly
Two days is achievable. The path is not paved with generic container guides. It is paved with memory discipline, IPv4-only networking, and SGLang’s quirks on ARM Blackwell. The DGX Spark’s unified memory architecture rewards patience over haste. Every shortcut taken in development doubles in production.
The writing of this article took its own shortcut as well: cloud LLM as scaffold, local Mistral for draft, human polish. Sovereign by output, not by every keystroke.
Reproducibility Checklist
Mistral’s review flagged the article for missing reproducibility. Fair. Here’s the exact stack and configuration that produced this site, so you can recreate it (or audit my claims).
Hardware
- Local dev: NVIDIA DGX Spark (GB10 Blackwell, ARM v9.2-A, 128 GB unified memory, 4 TB NVMe)
- VPS: FlokiNET Romania VPS II: Debian 13 Trixie, x86_64, 2 GB RAM, 50 GB Enterprise NVMe, ~€163/year, paid in bitcoin
Software versions (production)
| Component | Version |
|---|---|
| OS (VPS) | Debian 13.0, kernel 6.12.74+deb13+1-cloud-amd64 |
| Caddy | 2-builder + github.com/mholt/caddy-ratelimit plugin (xcaddy build) |
| Docker CE | 29.4.1 (official download.docker.com repo, not Debian’s docker.io) |
| Compose | v5.1.3 (docker compose plugin, not legacy v1) |
| Astro | 5.18.x with @astrojs/sitemap, astro-robots-txt |
| nginx (in container) | nginx:alpine, custom config |
| FastMCP | 1.x, Python 3.12, uvicorn, scikit-learn for TF-IDF |
| Inference (local) | SGLang nightly-dev-cu13-20260323, CUDA 13.0 |
| Model | Mistral Small 4 119B NVFP4 + EAGLE draft-head |
Critical config files
All committed in cipherfox/sovereign-blog and cipherfox/sovereign-grid-docs (private Gitea, mirrors available on request):
~/sovereign-blog/Caddyfile: reverse proxy + rate-limit + log routing +.well-knownCORS~/sovereign-blog/Dockerfile.caddy: xcaddy with caddy-ratelimit plugin~/sovereign-blog/docker-compose.https.yml: blog + caddy services, volumes for caddy_data/caddy_config/logs/srv~/sovereign-blog/nginx.conf: listen 4321 +absolute_redirect off; port_in_redirect off; server_name_in_redirect off;~/sovereign-mcp/Dockerfile: Python 3.12-slim + uv for deps~/sovereign-mcp/docker-compose.yml: mcp service joining externalsovereign-blog_defaultnetwork/etc/ssh/sshd_config.d/99-hardening.conf:PermitRootLogin no,PasswordAuthentication no,MaxAuthTries 3,AllowUsers cipherfox/etc/fail2ban/filter.d/caddy-mcp.conf+/etc/fail2ban/jail.d/caddy-mcp.local: 30×429/10min → 1h ban/etc/apt/apt.conf.d/52unattended-local: auto-reboot 04:00 UTC~/scripts/nsm-aggregate.py+~/scripts/nsm-init.sh: daily aggregator + idempotent setup wrapper~/sovereign-blog/srv/robots.txt:User-agent: *\nDisallow: /\nfor the MCP host
One-shot bootstrap
After provisioning the VPS with Debian 13 and adding an SSH public key:
# 1. Hardening (run once as root via sudo on the VPS)
sudo bash ~/scripts/nsm-init.sh # chmod logs, install cron, install user-crontab
# Plus: install ufw + fail2ban via apt, deploy sshd_config.d/99-hardening.conf
# 2. Docker official repo + Compose v2
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian trixie stable" | sudo tee /etc/apt/sources.list.d/docker.list
sudo apt update && sudo apt install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
sudo usermod -aG docker $USER
# 3. Caddy custom build with ratelimit plugin
cd ~/sovereign-blog
docker compose -f docker-compose.https.yml up -d --build
# 4. MCP container
cd ~/sovereign-mcp
docker compose up -d --build
# 5. Verify
curl -I https://sovgrid.org/
curl -s https://mcp.sovgrid.org/health
Benchmark numbers (own measurements)
- Mistral Small 4 119B NVFP4 + EAGLE on GB10: ~41 tok/s output (single-stream, EAGLE accept rate 2.5–3.4)
- Same model without EAGLE: 12–15 tok/s
- Context length: 65 536 tokens
- Memory utilization: 75 % static (
--mem-fraction-static 0.75) - PageSpeed Insights mobile after font-subsetting: 96, desktop: 100
- Caddy + Let’s-Encrypt-Cert acquisition: 6 seconds (HTTP-01 challenge)
- Initial HTML page weight (gzipped): 12 KB
Failure modes recreated
The five fixes referenced above are documented as standalone articles in /blog/:
fixes-sglang-restart-oom-fix:ExecStopPost=/bin/sync+ 60s wait before restartfixes-system-cleanup:/etc/gai.confIPv4 preference for HF downloadsfixes-cloudflared-astro-migration-2026-04-04: port 4321 → Caddy reverse-proxy migrationfixes-vibe-write-file-overwrite: race condition in Vibe’s edit pipelinefixes-sglang-vibe-performance-benchmark: empirical EAGLE accept-rate measurements
Each article includes the exact failing command output and the fix applied. Where a fix was a one-line systemd directive, that line is in the article verbatim. Where a fix was a sequence (stop service → wait for cleanup → restart), the script lives at the path referenced.