The complete stack that runs sovgrid.org and its consulting practice, component by component, with the reasoning for each pick and the alternatives I considered. Hub article. Updated 2026-05-25 after the Qwen primary migration, the Cloudflared retirement, the Astro 5 to 6 upgrade, and the switch.sh mutex pattern.

HUB The Sovereign AI Stack in 2026: A Reference Architecture

This is the stack that runs sovgrid.org and its consulting practice, end to end. It is honest about which components are owned, which are rented, and what would change for someone in a different situation. If you are scoping a sovereign-AI project for your own team, this is the article that gives you the bill of materials and the decision tree behind each line item.

The stack is described in twelve sections, organized by layer. Each section ends with the alternative I considered, the alternative I would recommend for a different buyer profile, and a cross-link to the engineering postmortem where the decision was load-tested. The last three sections cover paths: how to read the rest of the blog, how to connect your agent via MCP, and how to engage me directly.

This is the v2 refresh dated 2026-05-25. The v1 was written 2026-05-20 against the older state of the stack; the May 2026 model-stack migration to Qwen 3.6 PrismaQuant as the primary, the retirement of the Cloudflared tunnel in favor of direct Caddy + Let’s Encrypt on FlokiNET , the Astro 5 to 6 upgrade, and the switch.sh mutex pattern are the load-bearing changes documented in this revision.

Quick Take

  • Stack shape: one DGX Spark, two production LLMs (Qwen 3.6 PrismaQuant primary at 57 to 62 tok/s with DFlash, Mistral Small 4 NVFP4 as the safer-eagle fallback at 36.5 tok/s for vision and German prose), one switch.sh mutex enforcing memory exclusivity, deployed once and refactored fifteen times.
  • The stack is sovereign on six of six dimensions (custody, control plane, supply chain, identity, revenue path, network ingress) as of the May 2026 Cloudflared retirement. The trade-off of accepting more operational responsibility for DDoS hardening is named and accepted.
  • Build cost (2026): approximately €4,800 for the Spark (post-February-2026 supply-chain hike) and €1,400 for the surrounding ancillary equipment. Software is overwhelmingly open-source and self-hosted.
  • Operating posture: static-first publishing on Astro 6, headless inference, mesh networking via Tailscale, observability via Prometheus and the dashboard at services-sovereign-dashboard, payments via Lightning + bank transfer.
  • Comparison anchor: the same workload on a cloud-API stack would cost an order of magnitude more per call at the volumes I operate, while removing the customer-facing sovereignty story that is the actual product. (For the cost-model breakdown, Self-Hosted AI vs Cloud APIs: Real Total Cost walks the numbers.)

Section 1: Hardware base

The DGX Spark is the foundation. One unit, single-box, in the office.

The Spark is the right hardware for this stack because the workload is mixture-of-experts language models in the 35B-total / 3B-active range (Qwen 3.6) plus the 119B dense Mistral Small 4 as a kept-in-reserve fallback. Hardware specifications are documented at the NVIDIA DGX Spark product page and the DGX Spark Hardware Overview (GB10 Grace Blackwell superchip, 128 GB LPDDR5X unified memory, 20 Arm cores). For the pre-purchase decision tree itself, see Should You Buy a DGX Spark in 2026, the literal scoping article. For the reasoning behind the hardware pick versus the four real alternatives (Mac Studio M4 Max at 128 GB unified, Mac Studio M3 Ultra at 96 to 512 GB unified, dual RTX 3090 build, Strix Halo mini-PC), see DGX Spark vs M3 Ultra Mac Studio: Local LLM for the long-form comparison. For the war-stories on the same hardware, see Five DGX Spark Disasters I Survived.

The mini-PC is the dimension most operators skip. It runs Tailscale, Prometheus, the alerting stack, the backup orchestrator, and the watchdog scripts. Its job is to remain online when the Spark is restarting, to record what the Spark did before the crash, and to serve as the operator’s gateway to the system. Cost: €350 used. Value: very high.

The October 2026 cliff: Apple’s M5 Ultra Mac Studio is expected to ship in late 2026 (delayed by global memory chip shortages). The M3 Ultra remains the current top Apple SKU until then. The practical advice for buyers in May 2026 is binary: either commit now or wait the four-to-six months. The Spark is not on the same refresh cadence; the next-generation Blackwell-class workstation has no public roadmap as of this writing.

Alternative for a different buyer: if you do not need MoE-class language models, the dual RTX 3090 build at roughly €2,100 is the better value for dense LLM plus diffusion plus general lab work. If you need macOS ergonomics or the 512 GB unified memory ceiling, the Mac Studio M3 Ultra is the right answer. For a budget-tier breakdown across price points, see the four-article series: 2k beginner, 4k mid-tier, 8k premium, and 15k pro-studio.

Section 2: Operating system and management plane

Ubuntu LTS on the Spark, Debian on the mini-PC, no graphical desktop running on either by default.

The headless decision is operational, not aesthetic. The desktop session on the Spark is fragile when the inference backend hits an edge case. (See Fixes: vLLM MoE Throughput sm121 Desktop Freeze for the worked example: the default FlashInfer-MoE backend would freeze the desktop while inference continued, requiring an SSH reboot from another machine. The fix is VLLM_FLASHINFER_MOE_BACKEND=latency.) Running headless removes the failure mode entirely.

systemd is the service manager because every long-running component on this stack is wrapped as a unit file. The pattern is: one unit per logical service, restart policies tuned to the failure mode, after-dependencies declared explicitly, journal output piped to the dashboard. The vllm-qwen36.service unit exists but is deliberately not enabled at boot; mutual exclusion with Mistral is an operator job through switch.sh, not a systemd default, because picking the wrong default at boot would either lock the operator into Qwen for vision work that needs Mistral or have both services race for unified memory at startup.

The page-cache hijack pattern is the second operational receipt worth knowing on Spark specifically: after a vLLM or SGLang crash, the kernel page cache holds stale model weights, and a relaunch without echo 3 > /proc/sys/vm/drop_caches produces an OOM at roughly 95 GB usage. One shell command before every engine relaunch keeps this from biting in production.

Section 3: Networking and ingress

Tailscale for the operator mesh, Caddy as the reverse proxy on both the local Spark and the public-facing Floki VPS, direct Let’s Encrypt certificates instead of a Cloudflare tunnel as of May 2026.

The Cloudflared retirement is the May 2026 change worth flagging in this section. The previous architecture used a Cloudflare Tunnel for inbound traffic, which absorbed DDoS-class abuse at the edge but introduced a rented dimension that conflicted with the broader sovereignty posture. The migration replaced the tunnel with direct Caddy + Let’s Encrypt on the Floki VPS (Romania-hosted, FlokiNET infrastructure), which restores end-to-end ownership of the network path at the cost of accepting more operational responsibility for DDoS hardening.

The trade-off is named honestly. A serious DDoS against sovgrid.org now requires either rate-limiting at Caddy, IP-blocklisting at the VPS firewall, or scaling out to a second VPS. The Cloudflare Tunnel handled this class of abuse transparently. The motivation for the retirement was that the threat model for a one-person engineering blog is not a state-actor DDoS; it is the occasional vuln-scanner that Caddy’s edge-block pattern (see the floki/Caddyfile in the repo) handles cleanly. The retirement is a sovereignty win, not a security win, and the framing matters.

Tailscale is still rented for similar reasons. The mesh works out of the box, the key custody is acceptable for the threat model, and the operational overhead of running Headscale is real. I have rehearsed the migration path to Headscale for the case where Tailscale’s terms change in a way I do not accept, but I have not yet executed it. (See Caddy Cloudflare Tunnel Reliability Pattern for the historical version of this pattern.)

Section 4: Inference layer

vLLM serving Qwen 3.6 PrismaQuant 4.75bit as the production primary, SGLang serving Mistral Small 4 NVFP4 as the safer-eagle fallback at 36.5 tok/s for vision and creative-writing workloads, with the switch.sh mutex enforcing exclusivity.

The two-model decision is workload-driven. Qwen 3.6 is the right primary for code, agent tools, and structured-output workloads where 57 to 62 tok/s and 97 percent ToolCall-15 accuracy matter. Mistral is the right secondary for vision-reading and creative-writing tasks where the NVFP4 quant preserves the Pixtral-lineage vision tower (which the PrismaQuant 4.75bit Qwen quant drops) and the prose quality on German has not yet been beaten by an open competitor.

The mutex pattern inverts the conventional “one model serves all calls” in favor of “two models on disk, one hot, mutex enforced.” The reason is unified-memory contention: hot-loading both Qwen at 22 GB and Mistral at 60 GB simultaneously creates a memory cascade that pulls the desktop session down. The switch.sh script handles the systemctl start/stop pair, the Watchtower disable-label that prevents the auto-update loop, and a status check that confirms which model is currently hot.

For a buyer with a different workload mix, the answer changes. A code-only practice can drop Mistral and run Qwen alone, freeing the unified-memory budget for a co-resident image-generation pipeline. A creative-writing practice can flip the assignment. A vision-heavy practice will keep Mistral as primary and Qwen as secondary.

Section 5: Quantization and precision

PrismaQuant 4.75bit for Qwen, NVFP4 for Mistral, with the architectural reasoning recorded explicitly.

The right quantization for a model is not a property of the model; it is a property of the (model, workload, hardware) triple. NVFP4 is the right choice for Mistral on the Spark because the vision tower survives quantization, which matters for image-reading workloads. PrismaQuant 4.75bit is the right choice for Qwen 3.6 because it produces the highest measured single-Spark throughput on the public Spark Arena leaderboard, at the cost of dropping the vision tower from the local quant. (See Mistral vs Qwen 3.6: The Zero That Was a Broken Ruler for the verified-the-hard-way version of this finding, with the HTTP 200 round-trip on a real screenshot as the load-bearing evidence.)

For the quantization mental model in general, see NVFP4 Quantization Explained. The short version: quantization is lossy compression for model weights, the loss is bounded if you know what you are doing, and the bound is workload-dependent. NVFP4 is one of three quantization formats with serious Spark support (along with INT4 and MXFP4); the right pick depends on what you need to preserve from the unquantized baseline.

The corollary for buyers: do not trust the parameter count as a capability indicator. A 754B-class model like GLM-5.1 at AWQ INT4 is 377 GB on disk, three times the Spark’s 128 GB unified memory budget. The right question is not “which model is largest” but “which model is largest and still fits the hardware envelope I have decided to operate.”

Section 6: Speculative decoding

DFlash on Qwen for 57 to 62 tok/s, EAGLE on Mistral parked while the SGLang nightly regression is investigated, MTP n=3 stable on Qwen.

Speculative decoding sounds like free throughput. It is not free, and on some workloads it is net-negative. EAGLE’s draft distribution is structured-output-hostile; on JSON-emitting workloads, EAGLE drops throughput rather than raising it. (See Fixes: EAGLE Content-Dependent Throughput for the worked failure mode.) For the long version of “when does this technique help and when does it hurt,” the forthcoming companion eagle-speculative-decoding-when-helps-when-doesnt walks the decision tree.

The May 2026 state on Mistral: EAGLE on the current SGLang nightly was confirmed stable in the 2026-05-22 switch.sh cleanup session, measuring 36.5 tok/s versus the no-EAGLE baseline of 29 tok/s. Mistral is therefore running the safer-eagle configuration in production. The rollback path to sglang-mistral4-safer.sh (no-EAGLE) is available if the regression resurfaces and is tracked as a documented Gitea contingency.

DFlash on Qwen is stable and is the configuration that produces the 57 to 62 tok/s number. MTP n=3 is stable; n=4 regresses. The dispatcher in front of the inference backends knows the workload class and can disable speculation per request when the workload calls for it. This is the kind of detail that does not appear in vendor documentation and only appears in an operational runbook after the operator has been burned once.

Section 7: Image generation

FLUX.1-schnell on ComfyUI is the current production image-gen pipeline. The blog hero images for every article on this site come out of it; the speed (sub-second per image on the Spark, ~5 minutes for a 32-article hero batch) matches the publishing cadence.

The Qwen-Image-2512 model (a competing open-weights text-to-image model with materially better text-rendering quality, ELO 1161 on the artificialanalysis.ai leaderboard versus FLUX.1-schnell’s lower position) is a candidate for evaluation when the workload needs text-in-image generation: episode-cover art with titles, infographics, or beschriftete diagrams. For the current workload of photorealistic abstract motifs without text overlay, FLUX.1-schnell is the right tool by speed and quality both. No download, no benchmark, no production rotation switch made yet.

The “co-resident” decision is the operational payoff of the PrismaQuant quantization choice from Section 5. With Mistral as the LLM at ~60 GB on disk, the same image pipeline would not co-reside; the operator would be in the “switch.sh none, run image batches, switch.sh qwen again” mode that costs minutes per image-generation pass. The mutex pattern is what makes the co-resident image pipeline work in practice.

Section 8: Voice and TTS

No TTS in production right now. Voxtral parked after the V6 ceiling, next-engine spike in progress.

TTS is the layer where the sovereignty axis matters most for the podcast pipeline. Cloud TTS APIs have improved dramatically; sovereign TTS is still catching up. The choice to keep TTS local is partly aesthetic (the voice is recognizable, not a generic cloud voice) and partly defensive (the cloud TTS providers have history of removing voices from their catalog on short notice).

The next-engine spike (VibeVoice Day-1 complete, Higgs Audio v2 and IndexTTS-2 Day-2 and Day-3 pending) determines which engine inherits the production TTS slot. The Day-1 result was that VibeVoice ceilings around 7/10 on the V5 cold-open test, which is structurally similar to where Voxtral plateaued. The decision waits for Day-3 before the engine pick is final.

Section 9: Storage and backup

Two redundant storage paths, one cold-storage path, one off-site path, plus a USB stick recovery procedure that the operator is actually trained on.

The “backing up 119B parameters” problem is the unobvious one. The model weights are several tens of gigabytes per copy, and naive backup strategies fail at scale. The working pattern is to back up the configuration, the prompts, the customer data, and the model identifiers, then re-download the model weights from upstream on restore. The weights are reproducible from a known identifier; the customer data is not.

The USB-primary backup posture is a deliberate choice over auto-push patterns. The USB stick is manually rotated and lives in a fire-safe drawer. Floki-pull (the public-facing VPS pulling from the Spark on a schedule) is the secondary path for the static-site content. There is no Floki-push, and there is no auto-timer that would put a customer-data delta on the network without the operator’s explicit involvement.

Section 10: Observability and monitoring

Prometheus, Grafana, the sovgrid dashboard, healthcheck systemd timers running every five minutes, and a single Matrix alert path.

The alerting discipline is the dimension where most one-person stacks fail. Alert fatigue produces operators who ignore alerts; absent alerts produce operators who miss outages. The working pattern is one channel, one rule: never alert on something that is not actionable within thirty minutes. Everything else goes to a dashboard the operator checks once a day. (See the forthcoming companion self-hosted-observability-one-person-ai-stack for the operational discipline.)

The healthcheck install was a 4-day stale memory item discovered in the 2026-05-25 audit: the v1 of this article and several adjacent memory files claimed the units were “install pending” when in fact they had been live since 2026-05-21. The memory-pending-audit-quarterly cadence (see Section 13 below) is the operator discipline that catches this class of drift.

Section 11: Identity, publishing, and payment

Nostr for identity, Astro 6 + Caddy for publishing, Lightning + bank for payment, multi-agent AGENTS.md convention across all repositories.

The publishing layer is static-first because static sites are the most sovereign publishing surface available in 2026. There is no runtime dependency on a CMS, no database that can corrupt, no plugin marketplace that can break, and the archive is a flat directory of markdown files that can be read by any tool the operator chooses.

The payment layer is multi-channel because no single channel covers all customers. Lightning for the sovereign-native readers, bank transfer for the enterprise customers, and a fallback to a hardware wallet receive address for the contingency case.

The multi-agent convention is the operational dimension that gets the most quizzical looks from buyers and the most appreciative nods from other operators. The AGENTS.md per repo is the contract that says “this is how an agent should behave in this codebase,” and it includes the rules that prevent the most common multi-agent failure modes (broad commits picking up another agent’s uncommitted work, em-dash overuse in generated content, fact-fabrication on personal-experience numbers).

Section 12: Agent integration via MCP

The MCP server at sovgrid.org/self-hosted-ai is the canonical integration point for agents that want to talk to this stack.

For the reasoning and the build log, see Setup: Sovereign MCP Setup and Setup: MCP Listing Smithery 100. For the pattern catalog, see the forthcoming companion 5-mcp-patterns-beyond-search-the-database.

The MCP server is the integration surface that I expect to become the most-used customer-facing endpoint of this stack over the next year. Agents that want to ask the sovgrid corpus about specific topics can do so via the registered MCP. The protocol is open, the implementation is documented, and the addition of additional tools follows a published roadmap.

Section 13: Operator discipline

The operator-side disciplines that make the stack survive contact with multiple AI agents and the passage of time.

The operator-discipline layer is what most reference architectures skip and what most real stacks live or die by. The components above (multi-agent contract, memory audit, commit hooks, authorship trailers) are the operator-side equivalent of the inference-layer infrastructure described in Section 4. Neither layer is optional; both are load-bearing.

For the explicit version of the operator-discipline commitments themselves, see The Engineering Honesty Manifesto: six rules I hold this site to, each with a receipt from the operating log. For the operational receipts the discipline produces, see Five DGX Spark Disasters I Survived and Power Failure Recovery on a DGX Spark: 30-Minute Procedure. For the broader framing of what “sovereign” actually requires, see What Sovereign Actually Means in 2026.

Stack comparison: this stack vs cloud-API vs other self-hosted

DimensionThis stack (sovgrid)Cloud-API equivalentOther self-hosted (dual 3090)
Hardware capital~€4,800 + €1,400 ancillary€0~€2,100 + €1,000 ancillary
Per-month operating cost~€800scales with usage (Opus 4.7 at $5/$25 per Mt)~€600
Heavy-tier LLM modelQwen 3.6 PrismaQuant primary, Mistral Small 4 fallbackClaude Opus 4.7, GPT-5 heavysmaller dense models
Privacy / sovereignty6/6 dimensions owned (post-Cloudflared-retirement)0/6 dimensions owned5/6 dimensions owned
Setup time80 hours30 minutes40 hours
Best forsovereign-AI consulting, MoE workloadsoptionality, intermittent use, mini-tier (Haiku 4.5 / GPT-5 mini)dense LLM, diffusion, lab learning
Worst atdense >70B, 754B-classprivacy, lock-in, tokenizer changesMoE 100B+

The table is a compression. For the long form of the cost analysis, Self-Hosted AI vs Cloud APIs: Real Total Cost walks the model row by row. For the alternative hardware comparison, DGX Spark vs M3 Ultra Mac Studio: Local LLM walks the architectures. For the model-stack comparison, Mistral Small 4 vs Qwen 3.6 vs GLM-5: DGX Spark walks Qwen versus Mistral versus the 754B-class. For the tooling comparison, Vibe vs OpenClaw vs Aider vs Claude Code 2026 walks the coding-assistant choices.

Three paths from here

Read more (blog). The cross-links above are the load-bearing entries. Self-Hosted AI Start Here is the canonical onboarding for a reader who has just landed on the site. Two Leaderboards Nobody Reads Together is the honest argument about why benchmark numbers in vendor marketing are not what they appear to be. The Engineering Honesty Manifesto is the lens under which every other article on the site is written; pair it with What Sovereign Actually Means in 2026 for the framework that decides which dimensions of sovereignty actually matter for your use case.

Connect your agent (MCP). The MCP server at sovgrid.org/self-hosted-ai accepts agents from any OpenAI-compatible or MCP-native client. Add the server URL to your client’s MCP configuration, point a search query at it, and the agent will be able to retrieve articles from the corpus in real time. For the integration guide and the four tools the server exposes, see Setup: Sovereign MCP Setup and the six-week build log at MCP for Engineers Who Hate Marketing. For the agent-side architecture and why each agent should have its own wallet, see Why Your Agent Should Have Its Own Wallet (L402).

Work with cipherfox (Stack Audit). If your team is scoping a sovereign-AI deployment and you want a second pair of eyes from someone who has shipped this stack into production, that is the use case for a Stack Audit. The audit is paid, two hours, fixed-fee, and ends with a documented recommendation: own this stack, deploy a reduced variant, stay on cloud-API and revisit in a year, or take the hybrid path. The honest answer is the answer the math says, not the answer that drives upsell.

To book: reach me through any of the contact links in the footer of this page (Nostr DM is the fastest, the email link is HTML-entity-encoded so it survives spam scrapers, the GitHub profile takes issues too). Include the workload sketch in the first message: calls per day, model tier, privacy axis. The dedicated booking page is in active build.

The stack is real, it ships, it pays for itself, and it does so without a single inference call leaving the operator’s premises. That is the architectural fact that the rest of the marketing has been trying to imitate, and it is the architectural fact that the sovereign-AI consulting practice can actually defend in front of a customer’s CISO. The reference architecture above is the receipt.

What changed in v2 (2026-05-25 refresh)

For readers who saw the v1 published 2026-05-20:

The next planned refresh is 2026-08-25, synced with the memory-pending-audit-quarterly cadence.