HUB The Sovereign AI Stack in 2026: A Reference Architecture
This is the stack that runs sovgrid.org and its consulting practice, end to end. It is honest about which components are owned, which are rented, and what would change for someone in a different situation. If you are scoping a sovereign-AI project for your own team, this is the article that gives you the bill of materials and the decision tree behind each line item.
The stack is described in twelve sections, organized by layer. Each section ends with the alternative I considered, the alternative I would recommend for a different buyer profile, and a cross-link to the engineering postmortem where the decision was load-tested. The last three sections cover paths: how to read the rest of the blog, how to connect your agent via MCP, and how to engage me directly.
This is the v2 refresh dated 2026-05-25. The v1 was written 2026-05-20 against the older state of the stack; the May 2026 model-stack migration to Qwen 3.6 PrismaQuant as the primary, the retirement of the Cloudflared tunnel in favor of direct Caddy + Let’s Encrypt on FlokiNET ↗, the Astro 5 to 6 upgrade, and the switch.sh mutex pattern are the load-bearing changes documented in this revision.
Quick Take
- Stack shape: one DGX Spark, two production LLMs (Qwen 3.6 PrismaQuant primary at 57 to 62 tok/s with DFlash, Mistral Small 4 NVFP4 as the safer-eagle fallback at 36.5 tok/s for vision and German prose), one switch.sh mutex enforcing memory exclusivity, deployed once and refactored fifteen times.
- The stack is sovereign on six of six dimensions (custody, control plane, supply chain, identity, revenue path, network ingress) as of the May 2026 Cloudflared retirement. The trade-off of accepting more operational responsibility for DDoS hardening is named and accepted.
- Build cost (2026): approximately €4,800 for the Spark (post-February-2026 supply-chain hike) and €1,400 for the surrounding ancillary equipment. Software is overwhelmingly open-source and self-hosted.
- Operating posture: static-first publishing on Astro 6, headless inference, mesh networking via Tailscale, observability via Prometheus and the dashboard at services-sovereign-dashboard, payments via Lightning + bank transfer.
- Comparison anchor: the same workload on a cloud-API stack would cost an order of magnitude more per call at the volumes I operate, while removing the customer-facing sovereignty story that is the actual product. (For the cost-model breakdown, Self-Hosted AI vs Cloud APIs: Real Total Cost walks the numbers.)
Section 1: Hardware base
The DGX Spark is the foundation. One unit, single-box, in the office.
- One NVIDIA DGX Spark Founders Edition (GB10, 128 GB unified memory, NVMe SSD; Founders MSRP raised to $4,699 in February 2026 due to memory supply constraints)
- One UPS for graceful shutdown on power events
- One small NAS for backup destination (separate physical box)
- One mini-PC running Debian as the management plane
The Spark is the right hardware for this stack because the workload is mixture-of-experts language models in the 35B-total / 3B-active range (Qwen 3.6) plus the 119B dense Mistral Small 4 as a kept-in-reserve fallback. Hardware specifications are documented at the NVIDIA DGX Spark product page and the DGX Spark Hardware Overview (GB10 Grace Blackwell superchip, 128 GB LPDDR5X unified memory, 20 Arm cores). For the pre-purchase decision tree itself, see Should You Buy a DGX Spark in 2026, the literal scoping article. For the reasoning behind the hardware pick versus the four real alternatives (Mac Studio M4 Max at 128 GB unified, Mac Studio M3 Ultra at 96 to 512 GB unified, dual RTX 3090 build, Strix Halo mini-PC), see DGX Spark vs M3 Ultra Mac Studio: Local LLM for the long-form comparison. For the war-stories on the same hardware, see Five DGX Spark Disasters I Survived.
The mini-PC is the dimension most operators skip. It runs Tailscale, Prometheus, the alerting stack, the backup orchestrator, and the watchdog scripts. Its job is to remain online when the Spark is restarting, to record what the Spark did before the crash, and to serve as the operator’s gateway to the system. Cost: €350 used. Value: very high.
The October 2026 cliff: Apple’s M5 Ultra Mac Studio is expected to ship in late 2026 (delayed by global memory chip shortages). The M3 Ultra remains the current top Apple SKU until then. The practical advice for buyers in May 2026 is binary: either commit now or wait the four-to-six months. The Spark is not on the same refresh cadence; the next-generation Blackwell-class workstation has no public roadmap as of this writing.
Alternative for a different buyer: if you do not need MoE-class language models, the dual RTX 3090 build at roughly €2,100 is the better value for dense LLM plus diffusion plus general lab work. If you need macOS ergonomics or the 512 GB unified memory ceiling, the Mac Studio M3 Ultra is the right answer. For a budget-tier breakdown across price points, see the four-article series: 2k beginner, 4k mid-tier, 8k premium, and 15k pro-studio.
Section 2: Operating system and management plane
Ubuntu LTS on the Spark, Debian on the mini-PC, no graphical desktop running on either by default.
- Ubuntu 24.04 LTS on the Spark, kernel pinned to the NVIDIA-supplied version for Blackwell compatibility
- Debian 13 on the mini-PC management host
- systemd as the service manager, with the patterns documented in Systemd Patterns for Self-Hosted AI Services
- AIDE for file-integrity monitoring on the Spark, per AIDE and Tripwire for AI Boxes: File Integrity
- A
switch.shmutex at/data/scripts/llm/switch.shthat flips between Qwen on vLLM and Mistral on SGLang, enforcing that at most one inference engine has loaded weights into unified memory at a time
The headless decision is operational, not aesthetic. The desktop session on the Spark is fragile when the inference backend hits an edge case. (See Fixes: vLLM MoE Throughput sm121 Desktop Freeze for the worked example: the default FlashInfer-MoE backend would freeze the desktop while inference continued, requiring an SSH reboot from another machine. The fix is VLLM_FLASHINFER_MOE_BACKEND=latency.) Running headless removes the failure mode entirely.
systemd is the service manager because every long-running component on this stack is wrapped as a unit file. The pattern is: one unit per logical service, restart policies tuned to the failure mode, after-dependencies declared explicitly, journal output piped to the dashboard. The vllm-qwen36.service unit exists but is deliberately not enabled at boot; mutual exclusion with Mistral is an operator job through switch.sh, not a systemd default, because picking the wrong default at boot would either lock the operator into Qwen for vision work that needs Mistral or have both services race for unified memory at startup.
The page-cache hijack pattern is the second operational receipt worth knowing on Spark specifically: after a vLLM or SGLang crash, the kernel page cache holds stale model weights, and a relaunch without echo 3 > /proc/sys/vm/drop_caches produces an OOM at roughly 95 GB usage. One shell command before every engine relaunch keeps this from biting in production.
Section 3: Networking and ingress
Tailscale for the operator mesh, Caddy as the reverse proxy on both the local Spark and the public-facing Floki VPS, direct Let’s Encrypt certificates instead of a Cloudflare tunnel as of May 2026.
- Tailscale for the operator mesh (self-hosted alternative: Headscale; see
tailscale-vs-headscale-multi-box-sovereignforthcoming companion) - Caddy 2 with Let’s Encrypt ACME issuance, no Cloudflare DNS plugin required since the Cloudflared retirement
- Direct ingress to the Floki VPS that fronts the static site and the MCP server at sovgrid.org and mcp.sovgrid.org
- A Tor hidden service for the censored-network audience; see Tor Hidden Service for Sovereign AI: When and How
The Cloudflared retirement is the May 2026 change worth flagging in this section. The previous architecture used a Cloudflare Tunnel for inbound traffic, which absorbed DDoS-class abuse at the edge but introduced a rented dimension that conflicted with the broader sovereignty posture. The migration replaced the tunnel with direct Caddy + Let’s Encrypt on the Floki VPS (Romania-hosted, FlokiNET infrastructure), which restores end-to-end ownership of the network path at the cost of accepting more operational responsibility for DDoS hardening.
The trade-off is named honestly. A serious DDoS against sovgrid.org now requires either rate-limiting at Caddy, IP-blocklisting at the VPS firewall, or scaling out to a second VPS. The Cloudflare Tunnel handled this class of abuse transparently. The motivation for the retirement was that the threat model for a one-person engineering blog is not a state-actor DDoS; it is the occasional vuln-scanner that Caddy’s edge-block pattern (see the floki/Caddyfile in the repo) handles cleanly. The retirement is a sovereignty win, not a security win, and the framing matters.
Tailscale is still rented for similar reasons. The mesh works out of the box, the key custody is acceptable for the threat model, and the operational overhead of running Headscale is real. I have rehearsed the migration path to Headscale for the case where Tailscale’s terms change in a way I do not accept, but I have not yet executed it. (See Caddy Cloudflare Tunnel Reliability Pattern for the historical version of this pattern.)
Section 4: Inference layer
vLLM serving Qwen 3.6 PrismaQuant 4.75bit as the production primary, SGLang serving Mistral Small 4 NVFP4 as the safer-eagle fallback at 36.5 tok/s for vision and creative-writing workloads, with the switch.sh mutex enforcing exclusivity.
- vLLM 0.20+ with
VLLM_FLASHINFER_MOE_BACKEND=latencyset in the environment (the default is wrong for this hardware; see Fixes: vLLM MoE Throughput sm121 Desktop Freeze) - Qwen 3.6 PrismaQuant 4.75bit (Alibaba, Apache 2.0): 57 to 62 tok/s sustained interactive decode on a single Spark under DFlash speculative decoding; see Spark Arena Rank 4 Made Me Add Qwen3.6 for the original 45 tok/s baseline measurement and Mistral vs Qwen 3.6: The Zero That Was a Broken Ruler for the verified vision-asymmetry
- SGLang as the secondary backend for the Mistral path; see Setup: Mistral SGLang Setup and the safer-eagle configuration at 36.5 tok/s decode
- OpenClaw side-car proxy in front of Mistral to patch the alternating-roles BadRequestError; see Fixes: OpenClaw Mistral Alternating Roles
switch.sh qwen|mistral|none|statusas the mutex (Termux-friendly), plus a Watchtower disable-label onvllm-qwen36andsglang-mistral4that stopped a 385-restart cycle in May 2026- The model-stack-level comparison is Mistral Small 4 vs Qwen 3.6 vs GLM-5: DGX Spark
The two-model decision is workload-driven. Qwen 3.6 is the right primary for code, agent tools, and structured-output workloads where 57 to 62 tok/s and 97 percent ToolCall-15 accuracy matter. Mistral is the right secondary for vision-reading and creative-writing tasks where the NVFP4 quant preserves the Pixtral-lineage vision tower (which the PrismaQuant 4.75bit Qwen quant drops) and the prose quality on German has not yet been beaten by an open competitor.
The mutex pattern inverts the conventional “one model serves all calls” in favor of “two models on disk, one hot, mutex enforced.” The reason is unified-memory contention: hot-loading both Qwen at 22 GB and Mistral at 60 GB simultaneously creates a memory cascade that pulls the desktop session down. The switch.sh script handles the systemctl start/stop pair, the Watchtower disable-label that prevents the auto-update loop, and a status check that confirms which model is currently hot.
For a buyer with a different workload mix, the answer changes. A code-only practice can drop Mistral and run Qwen alone, freeing the unified-memory budget for a co-resident image-generation pipeline. A creative-writing practice can flip the assignment. A vision-heavy practice will keep Mistral as primary and Qwen as secondary.
Section 5: Quantization and precision
PrismaQuant 4.75bit for Qwen, NVFP4 for Mistral, with the architectural reasoning recorded explicitly.
The right quantization for a model is not a property of the model; it is a property of the (model, workload, hardware) triple. NVFP4 is the right choice for Mistral on the Spark because the vision tower survives quantization, which matters for image-reading workloads. PrismaQuant 4.75bit is the right choice for Qwen 3.6 because it produces the highest measured single-Spark throughput on the public Spark Arena leaderboard, at the cost of dropping the vision tower from the local quant. (See Mistral vs Qwen 3.6: The Zero That Was a Broken Ruler for the verified-the-hard-way version of this finding, with the HTTP 200 round-trip on a real screenshot as the load-bearing evidence.)
For the quantization mental model in general, see NVFP4 Quantization Explained. The short version: quantization is lossy compression for model weights, the loss is bounded if you know what you are doing, and the bound is workload-dependent. NVFP4 is one of three quantization formats with serious Spark support (along with INT4 and MXFP4); the right pick depends on what you need to preserve from the unquantized baseline.
The corollary for buyers: do not trust the parameter count as a capability indicator. A 754B-class model like GLM-5.1 at AWQ INT4 is 377 GB on disk, three times the Spark’s 128 GB unified memory budget. The right question is not “which model is largest” but “which model is largest and still fits the hardware envelope I have decided to operate.”
Section 6: Speculative decoding
DFlash on Qwen for 57 to 62 tok/s, EAGLE on Mistral parked while the SGLang nightly regression is investigated, MTP n=3 stable on Qwen.
Speculative decoding sounds like free throughput. It is not free, and on some workloads it is net-negative. EAGLE’s draft distribution is structured-output-hostile; on JSON-emitting workloads, EAGLE drops throughput rather than raising it. (See Fixes: EAGLE Content-Dependent Throughput for the worked failure mode.) For the long version of “when does this technique help and when does it hurt,” the forthcoming companion eagle-speculative-decoding-when-helps-when-doesnt walks the decision tree.
The May 2026 state on Mistral: EAGLE on the current SGLang nightly was confirmed stable in the 2026-05-22 switch.sh cleanup session, measuring 36.5 tok/s versus the no-EAGLE baseline of 29 tok/s. Mistral is therefore running the safer-eagle configuration in production. The rollback path to sglang-mistral4-safer.sh (no-EAGLE) is available if the regression resurfaces and is tracked as a documented Gitea contingency.
DFlash on Qwen is stable and is the configuration that produces the 57 to 62 tok/s number. MTP n=3 is stable; n=4 regresses. The dispatcher in front of the inference backends knows the workload class and can disable speculation per request when the workload calls for it. This is the kind of detail that does not appear in vendor documentation and only appears in an operational runbook after the operator has been burned once.
Section 7: Image generation
FLUX.1-schnell on ComfyUI is the current production image-gen pipeline. The blog hero images for every article on this site come out of it; the speed (sub-second per image on the Spark, ~5 minutes for a 32-article hero batch) matches the publishing cadence.
- FLUX.1-schnell as the production text-to-image model
- ComfyUI as the orchestration layer; see Setup: ComfyUI FLUX Setup
- Sequential with the LLM stack: image-gen and inference share unified memory, switched via the
switch.shmutex
The Qwen-Image-2512 model (a competing open-weights text-to-image model with materially better text-rendering quality, ELO 1161 on the artificialanalysis.ai leaderboard versus FLUX.1-schnell’s lower position) is a candidate for evaluation when the workload needs text-in-image generation: episode-cover art with titles, infographics, or beschriftete diagrams. For the current workload of photorealistic abstract motifs without text overlay, FLUX.1-schnell is the right tool by speed and quality both. No download, no benchmark, no production rotation switch made yet.
The “co-resident” decision is the operational payoff of the PrismaQuant quantization choice from Section 5. With Mistral as the LLM at ~60 GB on disk, the same image pipeline would not co-reside; the operator would be in the “switch.sh none, run image batches, switch.sh qwen again” mode that costs minutes per image-generation pass. The mutex pattern is what makes the co-resident image pipeline work in practice.
Section 8: Voice and TTS
No TTS in production right now. Voxtral parked after the V6 ceiling, next-engine spike in progress.
- Voxtral-4B (text-only fork) was the working pipeline through 2026-04. The V6 spot-listen showed a long-form expressivity ceiling and the open-checkpoint encoder is gated (no voice cloning available). See Voxtral Capped at 3/10: Picking the Next Open TTS for the receipts
- The next-model spike (VibeVoice / Higgs Audio v2 / IndexTTS-2) is currently being evaluated; no winner picked, no production TTS deployed
- An earlier May-11 plan recommended Kokoro plus F5-TTS as the fallback path; the pivot article above explicitly retracts that recommendation after applying the podcast-specific filter
TTS is the layer where the sovereignty axis matters most for the podcast pipeline. Cloud TTS APIs have improved dramatically; sovereign TTS is still catching up. The choice to keep TTS local is partly aesthetic (the voice is recognizable, not a generic cloud voice) and partly defensive (the cloud TTS providers have history of removing voices from their catalog on short notice).
The next-engine spike (VibeVoice Day-1 complete, Higgs Audio v2 and IndexTTS-2 Day-2 and Day-3 pending) determines which engine inherits the production TTS slot. The Day-1 result was that VibeVoice ceilings around 7/10 on the V5 cold-open test, which is structurally similar to where Voxtral plateaued. The decision waits for Day-3 before the engine pick is final.
Section 9: Storage and backup
Two redundant storage paths, one cold-storage path, one off-site path, plus a USB stick recovery procedure that the operator is actually trained on.
- NVMe SSD on the Spark for hot working state
- NAS box on the LAN for nightly snapshots; see Strategy: Backup and Disaster Recovery for the working pattern
- Encrypted off-site backup to a remote storage provider (rented dimension; the provider is named and the consequences are accepted)
- A bootable USB stick with the full recovery procedure documented on the wiki, refreshed quarterly; see Backing Up 119B Parameters Without Bankruptcy for the strategy
The “backing up 119B parameters” problem is the unobvious one. The model weights are several tens of gigabytes per copy, and naive backup strategies fail at scale. The working pattern is to back up the configuration, the prompts, the customer data, and the model identifiers, then re-download the model weights from upstream on restore. The weights are reproducible from a known identifier; the customer data is not.
The USB-primary backup posture is a deliberate choice over auto-push patterns. The USB stick is manually rotated and lives in a fire-safe drawer. Floki-pull (the public-facing VPS pulling from the Spark on a schedule) is the secondary path for the static-site content. There is no Floki-push, and there is no auto-timer that would put a customer-data delta on the network without the operator’s explicit involvement.
Section 10: Observability and monitoring
Prometheus, Grafana, the sovgrid dashboard, healthcheck systemd timers running every five minutes, and a single Matrix alert path.
- Prometheus scraping the Spark, the mini-PC, and the Floki VPS
- Grafana for the operator dashboard
- The sovgrid dashboard at services-sovereign-dashboard for the customer-facing health surface; see Services: Sovereign Dashboard
- The vllm-qwen36-healthcheck.timer systemd unit, enabled and active since 2026-05-21 22:48 CEST, running every 5 minutes with a decision tree for healthy / inactive / GPU-blocker / 2-fail-debounce auto-restart
- Daily healthcheck cron on the Floki VPS, with Matrix push on issues
- Alerts via Matrix, single channel, hard-rate-limited
The alerting discipline is the dimension where most one-person stacks fail. Alert fatigue produces operators who ignore alerts; absent alerts produce operators who miss outages. The working pattern is one channel, one rule: never alert on something that is not actionable within thirty minutes. Everything else goes to a dashboard the operator checks once a day. (See the forthcoming companion self-hosted-observability-one-person-ai-stack for the operational discipline.)
The healthcheck install was a 4-day stale memory item discovered in the 2026-05-25 audit: the v1 of this article and several adjacent memory files claimed the units were “install pending” when in fact they had been live since 2026-05-21. The memory-pending-audit-quarterly cadence (see Section 13 below) is the operator discipline that catches this class of drift.
Section 11: Identity, publishing, and payment
Nostr for identity, Astro 6 + Caddy for publishing, Lightning + bank for payment, multi-agent AGENTS.md convention across all repositories.
- Nostr identity rooted in an ed25519 key on the local machine; multiple npubs (cipherfox, hexabella, sovgrid) for separated public surfaces. Post via the hardened
/data/scripts/nostr/post.pyonly; nsecs never enter the agent context. - Astro 6.3.7 static site, built locally, deployed via rsync to the Floki VPS, served by Caddy. Migration from Astro 5.18.1 to 6.3.7 was completed 2026-05-24 in a single sitting (commit
a16ebd0in the sovereign-blog repo); the loader-pattern isglob({ pattern: '**/[^_]*.{md,mdx}', base: './src/content/blog' })per the Astro 6 content-layer API - Custom 5xx fallback page at floki/srv/500.html, served by Caddy’s handle_errors block when the blog backend or any upstream returns 500/502/503/504, so a backend outage does not show a generic Caddy error page to readers (BLOG-058)
- Alby ↗ Hub on the Spark for the Lightning node; see Setup: Alby Hub ARM64 Self-Hosted Lightning and the forthcoming companion
operators-guide-self-hosted-lightning - IBAN for invoice payments, bank account in operator’s name, no escrow or aggregator in the path
- Multi-agent AGENTS.md convention across 16 Gitea repositories: a standard frontmatter (
type: multi-agent-contract), a Session-Ritual section, a Verbindliche-Regeln block, anti-AI-Schreibregeln, concurrent-session discipline, tooling pointers, and cross-repo dependency map. The convention is what allows multiple AI agents (Claude Code, opencode, and others) to coexist in the same codebase without stepping on each other.
The publishing layer is static-first because static sites are the most sovereign publishing surface available in 2026. There is no runtime dependency on a CMS, no database that can corrupt, no plugin marketplace that can break, and the archive is a flat directory of markdown files that can be read by any tool the operator chooses.
The payment layer is multi-channel because no single channel covers all customers. Lightning for the sovereign-native readers, bank transfer for the enterprise customers, and a fallback to a hardware wallet receive address for the contingency case.
The multi-agent convention is the operational dimension that gets the most quizzical looks from buyers and the most appreciative nods from other operators. The AGENTS.md per repo is the contract that says “this is how an agent should behave in this codebase,” and it includes the rules that prevent the most common multi-agent failure modes (broad commits picking up another agent’s uncommitted work, em-dash overuse in generated content, fact-fabrication on personal-experience numbers).
Section 12: Agent integration via MCP
The MCP server at sovgrid.org/self-hosted-ai is the canonical integration point for agents that want to talk to this stack.
- FastMCP 1.27.0-based server with four tools (search_blog, list_tags, get_article, diagnose_sglang)
- Published to the official MCP registry as
org.sovgrid/self-hosted-ai, DNS-authenticated via ed25519 keys (no central authority required), live since 2026-05-05 - 100/100 score on Smithery, connector and server on Glama, awesome-mcp PR #5645 merged
- WebSite schema with SearchAction in BaseLayout for Google sitelinks-search-box discoverability (BLOG-057, shipped 2026-05-25)
For the reasoning and the build log, see Setup: Sovereign MCP Setup and Setup: MCP Listing Smithery 100. For the pattern catalog, see the forthcoming companion 5-mcp-patterns-beyond-search-the-database.
The MCP server is the integration surface that I expect to become the most-used customer-facing endpoint of this stack over the next year. Agents that want to ask the sovgrid corpus about specific topics can do so via the registered MCP. The protocol is open, the implementation is documented, and the addition of additional tools follows a published roadmap.
Section 13: Operator discipline
The operator-side disciplines that make the stack survive contact with multiple AI agents and the passage of time.
- Multi-agent contract (AGENTS.md per repo, 16 repos consolidated 2026-05-23): standard frontmatter, session-ritual, verbindliche Regeln, anti-AI-Schreibregeln, concurrent-session discipline.
- Memory-pending-audit-quarterly cadence: every quarter, grep through agent memory for “wartet auf X” / “blockiert” / “pending” claims and verify each one against current reality. Established 2026-05-25 after a single session uncovered five stale blockers including a two-day-stale Gitea-token rotation that was actually a five-second
docker execcommand. Next audits: 2026-08-25, 2026-11-25, 2027-02-25, 2027-05-25. - Pre-commit bulk-block hook: rejects commits touching more than N files unless
SOVEREIGN_BULK_OK=1is explicitly set, which catches the multi-agent failure mode where one agent’s broadgit add .picks up another agent’s in-flight work. - Authorship-trailer hooks: every commit carries an explicit agent identifier in the trailer, so a multi-agent audit of the git log is trivially possible after the fact.
- Quarterly content audit: every quarter, walk the article corpus for stale claims, broken cross-links, and outdated benchmarks. Last audit: 2026-05-03. Next audit: 2026-08-01.
The operator-discipline layer is what most reference architectures skip and what most real stacks live or die by. The components above (multi-agent contract, memory audit, commit hooks, authorship trailers) are the operator-side equivalent of the inference-layer infrastructure described in Section 4. Neither layer is optional; both are load-bearing.
For the explicit version of the operator-discipline commitments themselves, see The Engineering Honesty Manifesto: six rules I hold this site to, each with a receipt from the operating log. For the operational receipts the discipline produces, see Five DGX Spark Disasters I Survived and Power Failure Recovery on a DGX Spark: 30-Minute Procedure. For the broader framing of what “sovereign” actually requires, see What Sovereign Actually Means in 2026.
Stack comparison: this stack vs cloud-API vs other self-hosted
| Dimension | This stack (sovgrid) | Cloud-API equivalent | Other self-hosted (dual 3090) |
|---|---|---|---|
| Hardware capital | ~€4,800 + €1,400 ancillary | €0 | ~€2,100 + €1,000 ancillary |
| Per-month operating cost | ~€800 | scales with usage (Opus 4.7 at $5/$25 per Mt) | ~€600 |
| Heavy-tier LLM model | Qwen 3.6 PrismaQuant primary, Mistral Small 4 fallback | Claude Opus 4.7, GPT-5 heavy | smaller dense models |
| Privacy / sovereignty | 6/6 dimensions owned (post-Cloudflared-retirement) | 0/6 dimensions owned | 5/6 dimensions owned |
| Setup time | 80 hours | 30 minutes | 40 hours |
| Best for | sovereign-AI consulting, MoE workloads | optionality, intermittent use, mini-tier (Haiku 4.5 / GPT-5 mini) | dense LLM, diffusion, lab learning |
| Worst at | dense >70B, 754B-class | privacy, lock-in, tokenizer changes | MoE 100B+ |
The table is a compression. For the long form of the cost analysis, Self-Hosted AI vs Cloud APIs: Real Total Cost walks the model row by row. For the alternative hardware comparison, DGX Spark vs M3 Ultra Mac Studio: Local LLM walks the architectures. For the model-stack comparison, Mistral Small 4 vs Qwen 3.6 vs GLM-5: DGX Spark walks Qwen versus Mistral versus the 754B-class. For the tooling comparison, Vibe vs OpenClaw vs Aider vs Claude Code 2026 walks the coding-assistant choices.
Three paths from here
Read more (blog). The cross-links above are the load-bearing entries. Self-Hosted AI Start Here is the canonical onboarding for a reader who has just landed on the site. Two Leaderboards Nobody Reads Together is the honest argument about why benchmark numbers in vendor marketing are not what they appear to be. The Engineering Honesty Manifesto is the lens under which every other article on the site is written; pair it with What Sovereign Actually Means in 2026 for the framework that decides which dimensions of sovereignty actually matter for your use case.
Connect your agent (MCP). The MCP server at sovgrid.org/self-hosted-ai accepts agents from any OpenAI-compatible or MCP-native client. Add the server URL to your client’s MCP configuration, point a search query at it, and the agent will be able to retrieve articles from the corpus in real time. For the integration guide and the four tools the server exposes, see Setup: Sovereign MCP Setup and the six-week build log at MCP for Engineers Who Hate Marketing. For the agent-side architecture and why each agent should have its own wallet, see Why Your Agent Should Have Its Own Wallet (L402).
Work with cipherfox (Stack Audit). If your team is scoping a sovereign-AI deployment and you want a second pair of eyes from someone who has shipped this stack into production, that is the use case for a Stack Audit. The audit is paid, two hours, fixed-fee, and ends with a documented recommendation: own this stack, deploy a reduced variant, stay on cloud-API and revisit in a year, or take the hybrid path. The honest answer is the answer the math says, not the answer that drives upsell.
To book: reach me through any of the contact links in the footer of this page (Nostr DM is the fastest, the email link is HTML-entity-encoded so it survives spam scrapers, the GitHub profile takes issues too). Include the workload sketch in the first message: calls per day, model tier, privacy axis. The dedicated booking page is in active build.
The stack is real, it ships, it pays for itself, and it does so without a single inference call leaving the operator’s premises. That is the architectural fact that the rest of the marketing has been trying to imitate, and it is the architectural fact that the sovereign-AI consulting practice can actually defend in front of a customer’s CISO. The reference architecture above is the receipt.
What changed in v2 (2026-05-25 refresh)
For readers who saw the v1 published 2026-05-20:
- Qwen 3.6 throughput updated from 45 tok/s baseline to 57 to 62 tok/s with DFlash speculative decoding. The model-pick rationale is in Strategy: Next Model Choices on DGX Spark and the head-to-head against Mistral and GLM-5 is in Mistral Small 4 vs Qwen 3.6 vs GLM-5 on DGX Spark.
- Mistral configuration reframed from “creative-writing primary at 35 tok/s with EAGLE” to “safer-eagle fallback at 36.5 tok/s, confirmed stable in the 2026-05-22 cleanup session via switch.sh.” Speculative-decoding behaviour and its content-dependence is documented at Fixes: EAGLE Content-Dependent Throughput.
- Section 3 reflects the Cloudflared retirement: direct Caddy + Let’s Encrypt on Floki VPS, no tunnel. The reliability-pattern receipts are at Caddy and Cloudflare Tunnel: The Reliability Pattern.
- Section 4 documents the switch.sh mutex pattern, replacing the master.py dispatcher narrative. The unit-file patterns the mutex sits on top of are in Systemd Patterns for Self-Hosted AI Services.
- Section 11 updated for the Astro 5 to 6 migration completed 2026-05-24, plus the 5xx fallback page (BLOG-058) and the multi-agent AGENTS.md convention across 16 repos. The publishing-stack receipts are at Astro 6 + Caddy: The Static-First AI Blog Stack.
- Section 13 is new: operator discipline (multi-agent contract, memory audit cadence, commit hooks, authorship trailers) is now first-class in the reference architecture. The explicit version of the discipline is The Engineering Honesty Manifesto.
- Pricing updated for the February 2026 Spark MSRP hike from $3,999 to $4,699. The cost-comparison breakdown is at Self-Hosted AI vs Cloud APIs: Real Total Cost.
- Hire CTA replaced with the footer-contact-strip pattern (Nostr / encoded-email / GitHub), since the dedicated /hire/ page is still in build.
The next planned refresh is 2026-08-25, synced with the memory-pending-audit-quarterly cadence.