HUB Sovereign AI Grid: What's Working and What Comes Next
I’ve been an early adopter most of my life. Nostr when it had almost no users. AI tools before they had reliable UIs. The pattern is always the same: spot something disruptive early, build deep familiarity before it gets packaged into products that hide the internals. Whether this specific stack turns into something sustainable I genuinely don’t know. The complexity keeps growing faster than I can tame it. I’m doing this anyway.
The plan was a rented VPS. The reality is a local ARM64 machine running 128 GB of unified memory that the OS routinely lies about. This article is the operational map for everything running here: hardware, inference, the knowledge base, the parts that work, and the parts that still don’t.
If you’re reading this for the first time, the reading order is at the bottom.
State of the Stack (2026-04-27)
- Mistral Small 4 runs locally on DGX Spark: ARM64 GB10, 128 GB unified memory, no cloud inference
- SGLang OOMs are solved with a 60-second restart delay, documented in SGLang Restart OOM Fix
- Voxtral (TTS), ComfyUI (FLUX) and SGLang (Mistral) cannot share the GPU pool: one at a time, always
- The knowledge base (
sovereign-kb) is what keeps Mistral honest: the actual core of the pipeline- A local MCP server now exposes blog-search and SGLang-diagnose tools to OpenClaw and Vibe — agents stop re-asking what was already documented
- Lightning V4V earns less than €20/month: real, stable, not yet a business
- MCP Freemium layer is next: three tools, Lightning L402, targeting mid-May
The full pipeline, from raw source material through knowledge base injection to published article and podcast audio:
┌─────────────┐ ┌────────────────────────────────────────┐
│ Source │ │ Knowledge Base │
│ material │ │ sovereign-kb/ podcast-studio/kb/ │
│ articles │ │ overuse-phrases caps-allowlist │
│ notes │ │ prosody-markers back-channels │
│ topics │ │ voice-findings repair-templates │
└──────┬──────┘ └───────────────────────┬───────────────┘
│ │
└───────────────────────────────────┘
│
│ kb_service.py injects at prompt-build time
▼
┌──────────────────────────────────┐
│ Mistral Small 4 │ ← Vibe CLI
│ SGLang · Triton · DGX Spark │ OpenClaw
│ Part 1 → Part 2 → Naturalize │
└──────────────┬───────────────────┘
│
▼
┌────────────────────────┐
│ quality_gate.py │ regex · no LLM
└──────┬─────────────────┘
fail │ pass
─────────┼──────────────────────────
retry/abort│ │
│ ┌──────────────┴──────────┐
│ │ │
│ Voxtral TTS ComfyUI FLUX
│ podcast audio hero image
│ │ │
│ mix_audio.py │
│ └─────────────┬───────────┘
│ ▼
│ ┌────────────────────────┐
└───────────►│ Published │
│ article + MP3 │
└────────────────────────┘
Claude Code handles architecture and meta-work (cloud).
Mistral handles content execution at zero per-article cost.
OpenClaw bridges Cloud and local in a single session,
with MCP tools for self-service blog and diagnostics.
The Hardware: Why the VPS Plan Died
The original plan was a rented VPS, 32 GB RAM, Mistral Small 4 via SGLang. That was the plan until the NVIDIA DGX Spark arrived. The Spark runs an ARM64 GB10 chip with 128 GB unified memory shared between CPU and GPU. Inference latency dropped. Privacy improved. Nothing leaves the room for inference.
A VPS still handles blog hosting via nginx. When the IPv6 stack gets unstable, the blog goes briefly unreachable; inference is unaffected because it never runs through the VPS.
The full DGX Spark setup and its ARM64-specific gotchas are documented in Self-Host Mistral Small 4 with SGLang on NVIDIA DGX Spark. The performance benchmarks (tokens per second, EAGLE speculative decoding, Triton vs FlashInfer) are in SGLang on DGX Spark.
The Sequential-Services Rule
Three GPU-bound services compete for the same 128 GB unified pool: SGLang (Mistral, ~94 GB), Voxtral TTS (~111 GB while loaded), ComfyUI with FLUX.1-schnell (~14 GB). Combined they don’t fit. They alternate.
Article generation: SGLang runs continuously. Podcast audio: stop SGLang, wait 60 seconds, start Voxtral, generate, stop Voxtral. Hero images: stop SGLang, wait 60 seconds, start ComfyUI, generate, stop ComfyUI, restart SGLang. The 60-second wait is for unified memory to actually release after a docker kill — without it, the next service loads and OOM-kills on the first request.
A local dashboard with start/stop controls and a memory guard removed the manual coordination cost. The rule itself is simple: one inference service at a time. Detection in code is harder than it sounds because nothing crashes immediately when both are loaded. The OOM happens on the first real request, after everything looks healthy.
The Knowledge Base: The Actual Core
The pipeline doesn’t just run Mistral and hope. It runs Mistral against a curated knowledge base that gets loaded into every prompt at generation time. Without it, hallucination rates sit around 22%. With it, they’re around 12%. Not solved, but managed.
The knowledge base lives in two places:
/data/projects/sovereign-kb/: cross-project, versioned alongside this blog in Gitea. Contains:
mistral-overuse-phrases.md: 13 phrases Mistral overuses across all model sizes (Large to Small). “Here’s the thing”, “absolutely”, “great question” and ten others. Each one is filtered at the quality gate before publish. Case-insensitive, substring match, breaks the pipeline if found.prosody-markers.md: rules for how punctuation changes Voxtral’s spoken output.—creates trailing-off,?!is the strongest enthusiasm trigger, ellipsis creates hesitation. These don’t do anything in written text; they only matter for TTS generation.forbidden-markup-voxtral.md: markup that Voxtral reads literally instead of interpreting. Asterisks, angle brackets, bracket tags. The TTS pipeline strips these before synthesis.dialog-techniques.md: patterns for natural back-and-forth dialogue in the podcast pipeline. Repair sequences, back-channels, topic introductions.learning-principles.md: cognitive science grounding for why the podcast format works the way it does. Mayer’s Cognitive Theory of Multimedia Learning, parasocial interaction. Referenced when making decisions about script pacing.voice-findings.md: empirical results from Voxtral expressivity tests. Which voice presets produce which characteristics, whyspeed > 1.0sounds broken, why CAPS words get spelled out letter by letter.
/data/projects/podcast-studio/kb/: project-specific, loaded only for podcast generation:
caps-allowlist.md: 60+ technical acronyms Voxtral should spell out correctly (LLM, API, GPU, CIPHERFOX, NVFP4…). Without this list, the quality gate generates false warnings on every episode.back-channels.md: reactive turn patterns for HEXABELLA and CIPHERFOX. “Right.”, “Wait — really?”, “Hm.” Short acknowledgments that prevent one speaker from dominating 3+ consecutive turns.repair-templates.md: when a host gets a fact wrong, how the other host corrects it naturally without breaking conversational flow.topic-intros.md: how to open a new topic without using the phrases inmistral-overuse-phrases.md.
The KB loads at prompt-build time via kb_service.py. Every entry has a frontmatter ID, type, scope, and severity. High-severity entries block publish if their constraints are violated. The KB is the single source of truth: if a rule isn’t in the KB, it isn’t enforced.
# kb_service.py — how entries get loaded
def load_kb_entries(paths: list[Path]) -> list[dict]:
entries = []
for p in paths:
meta, body = parse_frontmatter(p)
if meta.get("scope") in ("cross-project", "podcast"):
entries.append({"id": meta["id"], "type": meta["type"],
"severity": meta.get("severity", "low"), "body": body})
return entries
The quality gate (quality_gate.py) is deterministic: regex and string matching, no LLM. It runs after every generation and after naturalization. High-severity failures abort the pipeline. Warnings pass through but get logged.
MCP: Giving the Agents Eyes
A recurring failure pattern: I’d hit an SGLang OOM, paste the error to whatever AI agent was open, and watch it suggest --attention-backend flashinfer or some other option that crashes on GB10. The agents had no way to know about hardware-specific quirks unless I re-explained them every session.
The fix is a small MCP server (sovereign-mcp, FastMCP, port 8002) with three tools:
search_blog: TF-IDF over published articlesget_article: fetch full text by slugdiagnose_sglang: returns the seven GB10/SM121A rules
Both OpenClaw (the local-and-cloud agent) and Vibe (Mistral-only CLI) connect to it. When SGLang misbehaves, they check the diagnose tool. When asked to write something, they search existing articles first to avoid duplicates and to build on what’s there.
The MCP server runs as a systemd service, autostart, no GPU usage. Restart is exposed in the dashboard with a confirmation dialog and a NOPASSWD-sudoers entry scoped to that one command — no wildcards. Setup details in Sovereign MCP Server Setup.
The Podcast: HEXABELLA + CIPHERFOX
The pipeline generates more than blog articles. It also produces podcast episodes: two-host conversation format, synthetic dialogue, real technical content.
CIPHERFOX is me. HEXABELLA is an AI persona: a helpful, curious conversational partner who plays the “explain it to me” role. The dynamic works because HEXABELLA isn’t performing ignorance. She asks the questions a technically curious non-expert would actually ask, and the answers have to be real.
The current frame for the show: the story of an AI beginner setting out into a new world. Episodes cover sovereign AI, Nostr, Lightning, and the infrastructure behind decentralized communication. Not just how to set things up. Why they exist and why the timing matters.
The podcast runs on V4V. Available wherever there’s no KYC requirement (Apple Podcasts has a developer account requirement that conflicts with the model). The goal is to publish alongside blog articles: same technical material, different format, different audience entry point.
sovereign-kb/ and podcast-studio/kb/ both feed into episode generation. The dialogue is synthesized by Mistral Small 4, naturalized against the KB rules, and read by Voxtral for HEXABELLA’s voice. Everything runs on the Spark. Zero cloud dependency once the pipeline is running.
The Inference Stack
Mistral Small 4 via SGLang on the DGX Spark. Triton attention backend, not FlashInfer. FlashInfer’s initialization allocates 2 GB on top of the model and crashes on GB10’s SM 12.1 architecture. Triton gives 35–41 tokens per second with EAGLE speculative decoding on typical article workloads.
OpenHands handles longer autonomous coding tasks with enable_prompt_extensions=false. That flag is non-negotiable: without it, the alternating-roles constraint in SGLang/Mistral triggers BadRequestError on every multi-turn agent call. Full diagnosis in Fix: OpenHands BadRequestError.
Vibe CLI wraps SGLang for interactive single-session coding. OpenClaw is the newer agent: Matrix bridge, can swap between Anthropic models (Claude Sonnet 4.6, Opus 4.7) and the local Mistral mid-session, with MCP tools for self-service. For complex multi-step architecture work, Claude Code takes over. That’s a deliberate cost, documented in Six Weeks Running Mistral Small 4 and Why I Kept Claude Code + Vibe.
# SGLang launch — DGX Spark ARM64
python -m sglang.launch_server \
--model-path /models/Mistral-Small-4 \
--port 30000 \
--backend triton \
--mem-fraction-static 0.75 \
--max-running-requests 16
# --backend flashinfer → instant crash on GB10 ARM64
What Breaks and the Actual Fixes
SGLang OOM after restart. On the GB10’s unified memory architecture, killing SGLang doesn’t immediately free the GPU pool; the OS marks memory as free but doesn’t release it. Restarting immediately causes an OOM that looks like a leak. Fix: wait 60 seconds. Full diagnosis and systemd config in SGLang Restart OOM Fix.
Voxtral, ComfyUI, and Mistral cannot coexist. Voxtral loads 111 GB of the 128 GB pool. SGLang needs the rest. ComfyUI adds 14 GB. They alternate: stop one → wait 60s → run the next → stop → restart the previous. Three services, one pool, no shortcuts.
Citation hallucinations at 12%. The quality scorer flags missing version strings, invented file paths, and citations that don’t appear in the KB. Prompt engineering got the rate down from 22%. The KB keeps it there. The MCP search_blog tool helps further by letting the agent verify before claiming. Manual review still catches the remainder. Not solved, managed.
OpenClaw streaming watchdog reset. When switching to Sonnet mid-session, the first stream sometimes drops at the 30-second watchdog. Workaround: send a new message, the next stream resyncs. Appears to be API-side, not OpenClaw-side.
VPS IPv6 drops. Affects blog uptime only, not inference. Dual-stack nginx config with IPv4 fallback keeps the blog reachable most of the time. A more stable VPS is on the migration list.
# /etc/systemd/system/sglang-mistral4.service
[Service]
ExecStopPost=/bin/sleep 60 # wait for unified memory to actually release
Restart=on-failure
RestartSec=65
Watch out:
--backend flashinferon GB10: silent crash or OOM. Always usetriton--mem-fraction-staticabove 0.85 on GB10: OOM at startupenable_prompt_extensions=falsefor OpenHands: not optional- After killing SGLang: 60 seconds. Not 10, not 30
- Voxtral startup: 60–90 seconds for model load; check
/v1/modelsbefore starting TTS - Sovereign MCP on port 8002, not 8001: collides with Voxtral if you swap them
What Comes Next
MCP Freemium layer: three tools (quality validator, affiliate-link checker, image-caption generator) behind Lightning L402 microtransactions. Mistral Small 4 for inference, no cloud. First tool targets mid-May. Architecture overview in From Blog to Agent Tools.
Persistent AI memory: Mem0 + ChromaDB. The current workaround (VIBE.md, a 400-line markdown file) doesn’t scale. The plan is a proper memory layer so session context doesn’t have to be manually maintained across every Vibe CLI call. OpenClaw already has a workspace pattern (USER.md, HEARTBEAT.md, TOOLS.md) that goes partway here.
Voice input: Whisper running locally on the Spark. The podcast pipeline already has Voxtral for output. Whisper closes the loop for voice-in, voice-out sovereign interaction.
Hetzner migration: blog hosting off Njalla. More stable IPv6, same privacy model.
Who This Is For
Not everyone will find this useful. Honest priority ordering:
Engineers who want to self-host. Already motivated, just need working configs. Highest conversion likelihood for MCP tools and affiliate links. The fixes and setup articles exist for them.
Privacy-first developers and Nostr builders. Aware of the dependency problem, looking for alternatives. Natural audience for V4V, Lightning, and the decentralization content.
Technical founders considering local AI. Thinking about cost, data control, and lock-in. The benchmark and strategy articles speak to them; consulting is the revenue model at this layer.
Curious generalists. Interested in the space, not necessarily builders. The podcast covers this entry point. Narrative first, configs later.
The first two groups convert immediately. The third is slower but higher value per interaction. The fourth is audience, not customer yet. But audience eventually becomes customer.
Measurable Goals
This is a hobby that has business potential. Treating it without checkpoints means it stays a hobby. Three checkpoints, relative metrics:
| Checkpoint | Target | Signal |
|---|---|---|
| Month 3 (Jul 2026) | First MCP tool live, 10 paying users via L402 | Real money, any amount |
| Month 6 (Oct 2026) | EUR 100/month recurring across streams | Which stream delivers: affiliate, V4V, or MCP |
| Month 12 (Apr 2027) | 3 tools live, podcast at 200 listeners/episode | Traction without paid acquisition |
Revenue streams in order of earliest expected return: Affiliate (live now), V4V (growing), MCP L402 (next), Consulting (later). If one stream outperforms for 90 days, double it. If one stays flat for 90 days, treat it as a data point.
Start Here: Reading Order
If you’re building something similar, this is the order that makes sense:
- Hardware and inference → Self-Host Mistral Small 4 with SGLang on DGX Spark
- Performance reality → SGLang on DGX Spark: Benchmarks
- The OOM problem → SGLang Restart OOM Fix
- Agent automation → Self-Hosted AI Coding Agent (OpenHands)
- Hybrid tool strategy → Six Weeks Running Mistral Small 4
- Content pipeline → Self-Hosted AI Content Pipeline
- Monetization → Alby Lightning Wallet Setup
- Where it’s going → From Blog to Agent Tools
Everything links back here. This article gets updated when the stack changes.