← All articles

#strategy

38 articles

Roadmaps, concepts, meta-reflections, and benchmarks across the sovereign AI stack.

Two Days From Localhost to Production: Building a Hybrid Sovereign AI Site

Two Days From Localhost to Production: Building a Hybrid Sovereign AI Site

A two-day build log from localhost to a sovereign hybrid AI site. Three failure modes, exact fixes, and the reproducibility checklist most cloud guides skip.

Read article →
gpt-oss-120b pulls nearly four million downloads a month, so I assumed it was a one-command experience. Getting it to serve on a DGX Spark took a frozen box, a 25GB image pull strangled by a Tor proxy, and a 43-minute kernel compile. Then the measurement: on my own coding tasks the 120B scored 56 percent where the 35B Qwen I already run scored 100. Here is the full teardown, with every number measured on the box and the failed measurements thrown out, not published.
dgx-sparkcomparisonbenchmarkingvllm

I Built OpenAI's gpt-oss-120b on a Single DGX Spark. My 35B Qwen Out-Coded It.

gpt-oss-120b pulls nearly four million downloads a month, so I assumed it was a one-command experience. Getting it to serve on a DGX Spark took a frozen box, a 25GB image pull strangled by a Tor proxy, and a 43-minute kernel compile. Then the measurement: on my own coding tasks the 120B scored 56 percent where the 35B Qwen I already run scored 100. Here is the full teardown, with every number measured on the box and the failed measurements thrown out, not published.

NVIDIA's Nemotron-3-Super-120B-A12B is tuned for Blackwell and ships an NVFP4 build that fits a single 128GB DGX Spark. I measured it where almost nobody else does: single-stream, on one GB10. The result is 23.7 tok/s, a competent but painfully verbose coder, and a genuinely strong retrieval agent. Here is the full teardown, with the published benchmarks fact-checked against what the box actually did.
dgx-sparkbenchmarkingmcpvllm

I Ran NVIDIA's 120B Nemotron on a Single DGX Spark. It Is Smart, Slow, and Surprisingly Good at One Job

NVIDIA's Nemotron-3-Super-120B-A12B is tuned for Blackwell and ships an NVFP4 build that fits a single 128GB DGX Spark. I measured it where almost nobody else does: single-stream, on one GB10. The result is 23.7 tok/s, a competent but painfully verbose coder, and a genuinely strong retrieval agent. Here is the full teardown, with the published benchmarks fact-checked against what the box actually did.

I built a small, dependency-free harness that answers one question with numbers instead of vibes: does this enhancement make my agent measurably better, on my models, on my tasks? Here is the method, what I found, and why deterministic gates are the whole point.
agentsopencodebenchmarking

Agent-bench: stop trusting install counts, start measuring your agent's tools

I built a small, dependency-free harness that answers one question with numbers instead of vibes: does this enhancement make my agent measurably better, on my models, on my tasks? Here is the method, what I found, and why deterministic gates are the whole point.

I run Qwen3.6-35B at 4.75-bit for coding. A 4.0-bit AutoRound build promised more speed. Fewer bits usually means a dumber model, so I measured both halves: decode throughput and coding quality, the latter through my own agent-bench harness. The result settled it. Here is the duel, the bandwidth math, and why the bit count was the wrong thing to fear.
qwendgx-sparkbenchmarking

Smaller, Faster, Still Smart? AutoRound int4 vs PrismaQuant for a Self-Hosted Coding Model

I run Qwen3.6-35B at 4.75-bit for coding. A 4.0-bit AutoRound build promised more speed. Fewer bits usually means a dumber model, so I measured both halves: decode throughput and coding quality, the latter through my own agent-bench harness. The result settled it. Here is the duel, the bandwidth math, and why the bit count was the wrong thing to fear.

caveman has ~200k installs and claims 75% token reduction. I measured it on two local models and three Claude frontiers (Sonnet 4.6, Opus 4.8, Fable 5). The math does not work out the way the claim says it does.
agentsbenchmarking

Caveman: does the 75% token-saving skill survive contact with a self-hosted model?

caveman has ~200k installs and claims 75% token reduction. I measured it on two local models and three Claude frontiers (Sonnet 4.6, Opus 4.8, Fable 5). The math does not work out the way the claim says it does.

Serena is one of the most-installed coding MCP servers. I tested it against two local models (Qwen3.6-35b and Mistral-Small-4) on three refactor tasks with deterministic gates. The short answer is more interesting than yes or no.
mcpagentsbenchmarking

Does Serena help a self-hosted coding model? I benchmarked it

Serena is one of the most-installed coding MCP servers. I tested it against two local models (Qwen3.6-35b and Mistral-Small-4) on three refactor tasks with deterministic gates. The short answer is more interesting than yes or no.

Monitoring one VPS with a Prometheus stack is like hiring a security team for a garden shed. I wrote a 315-line bash script instead: one SSH session, twelve checks, one morning notification. Here is the design, the honest comparison against the usual suspects, and why detect-and-alert beats auto-fix at this scale.
devops

vps-healthcheck: Twelve Daily Checks, One SSH Session, One Notification

Monitoring one VPS with a Prometheus stack is like hiring a security team for a garden shed. I wrote a 315-line bash script instead: one SSH session, twelve checks, one morning notification. Here is the design, the honest comparison against the usual suspects, and why detect-and-alert beats auto-fix at this scale.

Value-for-value as the monetization model for sovgrid. The architectural fact (the channel exists) versus the dollar volume (zero sats received as of the most recent ground-truth audit). The honest version of what V4V is and is not, six months in.
authoritylightning

Refusing the Subscription Trap: A Year of V4V Lessons

Value-for-value as the monetization model for sovgrid. The architectural fact (the channel exists) versus the dollar volume (zero sats received as of the most recent ground-truth audit). The honest version of what V4V is and is not, six months in.

About a third of the people who ask me end up not buying. Six specific 'don't buy' clauses, four buyer profiles, the four real alternatives ranked, a flowchart, and the operational receipts (drop_caches=3, VLLM_FLASHINFER_MOE_BACKEND=latency, 30-minute recovery runbook) the spec sheet does not give you. Lead-magnet source; also gated as PDF.
dgx-sparkfunnel

Should You Buy a DGX Spark in 2026? The Honest Decision Tree

About a third of the people who ask me end up not buying. Six specific 'don't buy' clauses, four buyer profiles, the four real alternatives ranked, a flowchart, and the operational receipts (drop_caches=3, VLLM_FLASHINFER_MOE_BACKEND=latency, 30-minute recovery runbook) the spec sheet does not give you. Lead-magnet source; also gated as PDF.

Two days of reverse-proxy work, a full Caddy stack with Let's Encrypt TLS and basic-auth in front of opencode web, all working. Then I realized I am not the right user for it. The actual mobile answer was already on my phone, and OpenWebUI quietly took over the other half of the use case.
opencodeagentsdevops

I Built a Web UI for Mobile Coding. Termux Won Anyway.

Two days of reverse-proxy work, a full Caddy stack with Let's Encrypt TLS and basic-auth in front of opencode web, all working. Then I realized I am not the right user for it. The actual mobile answer was already on my phone, and OpenWebUI quietly took over the other half of the use case.

HackerNoon ranks coding LLMs by programming language. WhatLLM.org aggregates LiveCodeBench, Terminal-Bench and SciCode. Neither tests self-hosted models on real hardware. A self-hoster's reading protocol for coding leaderboards.

Three Coding Leaderboards, Three Blind Spots: What HackerNoon and WhatLLM Don't Tell Self-Hosters

HackerNoon ranks coding LLMs by programming language. WhatLLM.org aggregates LiveCodeBench, Terminal-Bench and SciCode. Neither tests self-hosted models on real hardware. A self-hoster's reading protocol for coding leaderboards.

The planning post before the implementation post. FIPS is an open-source mesh protocol with cryptographic identity and transport-agnostic routing. My sovereign AI stack is sovereign at the model and the hardware, and leaks the whole workload at the network boundary. Here is why that gap matters, the five concrete pieces of work I am committing to, and why I write the plan in public before I know if it works.
nostragentsmcpdgx-spark

FIPS, the Mesh Protocol, and Why I Need to Build It to Believe It

The planning post before the implementation post. FIPS is an open-source mesh protocol with cryptographic identity and transport-agnostic routing. My sovereign AI stack is sovereign at the model and the hardware, and leaks the whole workload at the network boundary. Here is why that gap matters, the five concrete pieces of work I am committing to, and why I write the plan in public before I know if it works.

I almost published 'Mistral Small 4 scores 0/30 on coding, the quant kills it'. A competent model scoring exactly zero should have been the red flag. The benchmark harness was hanging behind this stack's Tor docker proxy and never reached the model. Here is the broken-ruler story, the direct measurement that replaced it, and every Mistral-vs-Qwen3.6 number at a glance, including which one can actually read an image.
qwenmistraldgx-sparkdevops

Mistral vs Qwen3.6 on DGX Spark: the 0/30 That Was a Broken Ruler

I almost published 'Mistral Small 4 scores 0/30 on coding, the quant kills it'. A competent model scoring exactly zero should have been the red flag. The benchmark harness was hanging behind this stack's Tor docker proxy and never reached the model. Here is the broken-ruler story, the direct measurement that replaced it, and every Mistral-vs-Qwen3.6 number at a glance, including which one can actually read an image.

This blog gates every article behind one Python scorer before it publishes. I gave Qwen3.6 and Mistral Small 4 the same brief, the Start Here hub article this site still owes, and ran the raw output through that real gate with no editing. Both passed. Both invented hardware, processes, and benchmarks the scorer counted as quality. Here is the full method, the two source texts, and why a passing score is a floor and not a truth filter.
qwenmistraldgx-sparkdevops

The Quality Gate That Rewards Fabrication: I Had Qwen and Mistral Write This Blog

This blog gates every article behind one Python scorer before it publishes. I gave Qwen3.6 and Mistral Small 4 the same brief, the Start Here hub article this site still owes, and ran the raw output through that real gate with no editing. Both passed. Both invented hardware, processes, and benchmarks the scorer counted as quality. Here is the full method, the two source texts, and why a passing score is a floor and not a truth filter.

Six traits I keep seeing across the people who fit the sovereign-engineer description: they argue with specs, name every dependency, default to publishing, plan in decade arcs while shipping weekly, price friction honestly, and gate their optimism. Written from the outside, by the operator who runs the iron the sovereign software eventually touches.
nostrdgx-spark

The Quiet Pattern Among Sovereign Engineers

Six traits I keep seeing across the people who fit the sovereign-engineer description: they argue with specs, name every dependency, default to publishing, plan in decade arcs while shipping weekly, price friction honestly, and gate their optimism. Written from the outside, by the operator who runs the iron the sovereign software eventually touches.

Eleven VibeVoice renders, one Voxtral baseline, the operator's ears. The first day of the three-day TTS spike that follows the V6=0/10 verdict. Engineering-log shape, with the actual audio embedded.
podcasttts

TTS Spike Day 1: VibeVoice Sample Matrix on DGX Spark

Eleven VibeVoice renders, one Voxtral baseline, the operator's ears. The first day of the three-day TTS spike that follows the V6=0/10 verdict. Engineering-log shape, with the actual audio embedded.

Five posts a week, no marketing department, no template-substitution. Building a Nostr distribution cadence for a self-hosted blog that does not embody what readers can spot in two scrolls.
nostr

How to Auto-Post on Nostr Without Reading Like a Bot

Five posts a week, no marketing department, no template-substitution. Building a Nostr distribution cadence for a self-hosted blog that does not embody what readers can spot in two scrolls.

Eight engineering fixes deep, three weeks of patches, two failure modes on the same engine. The Voxtral open checkpoint has no path to release-quality podcast audio. The drama of staying with it anyway, and the three engines I plan to spike next.
podcastttsvoxtral

Voxtral Capped at 3/10: Picking the Next Open TTS

Eight engineering fixes deep, three weeks of patches, two failure modes on the same engine. The Voxtral open checkpoint has no path to release-quality podcast audio. The drama of staying with it anyway, and the three engines I plan to spike next.

My MCP-server NSM page showed 334 unique agents. One change to the aggregator (User-Agent plus IP /24 dedupe) and the truth surfaced: 86% of external hits come from a single /24 range, the rest are mostly automated probes. Headline metrics that look like reach can be five services pretending to be many.
mcp

Why 334 Unique IPs Was Really 5 Services in Trench Coats

My MCP-server NSM page showed 334 unique agents. One change to the aggregator (User-Agent plus IP /24 dedupe) and the truth surfaced: 86% of external hits come from a single /24 range, the rest are mostly automated probes. Headline metrics that look like reach can be five services pretending to be many.

Each number on the live Insights page has a formula, a business meaning, and a vanity-trap. If you are running a DGX Spark as the engine of a small AI service, here is how to read the dashboard daily without chasing growth-theatre, and which two metrics are the only ones worth waking up to check.
mcpdevops

How to Read the Insights Dashboard for a DGX-Spark Business, Not a Hobby Blog

Each number on the live Insights page has a formula, a business meaning, and a vanity-trap. If you are running a DGX Spark as the engine of a small AI service, here is how to read the dashboard daily without chasing growth-theatre, and which two metrics are the only ones worth waking up to check.

Qwen3.6-35B-A3B PrismaQuant at 95 tok/s on a single Spark (Spark Arena rank 4) beats my measured Mistral Small 4 at 35 tok/s by 2.7x on paper. This is the plan, not the result. SWE-Bench scores, opencode replacing vibe, why Mistral stays installed for creative prose, the Hacker News critiques on opencode I take seriously, and the two-day prep before day-2 measurements land on 2026-05-25.
dgx-spark

Spark Arena Rank 4 Made Me Add Qwen3.6 to My DGX Spark

Qwen3.6-35B-A3B PrismaQuant at 95 tok/s on a single Spark (Spark Arena rank 4) beats my measured Mistral Small 4 at 35 tok/s by 2.7x on paper. This is the plan, not the result. SWE-Bench scores, opencode replacing vibe, why Mistral stays installed for creative prose, the Hacker News critiques on opencode I take seriously, and the two-day prep before day-2 measurements land on 2026-05-25.

Rendering a 367-character podcast turn as one Voxtral call takes 21 seconds. Split into 90-character chunks: 35 seconds. Same words, same voice, 38 percent more wallclock.
devopspodcastttsvoxtral

Voxtral Chunk Strategy: 38 Percent Faster Render with Whole Turns

Rendering a 367-character podcast turn as one Voxtral call takes 21 seconds. Split into 90-character chunks: 35 seconds. Same words, same voice, 38 percent more wallclock.

Wired a browser search form directly into an MCP tool that AI agents already call. One afternoon, four endpoints, zero CORS, real numbers from the deploy. The mistakes that cost me an hour are documented inline.
mcp

I Gave My Blog a Search Box, and It Runs Through My Own MCP Server

Wired a browser search form directly into an MCP tool that AI agents already call. One afternoon, four endpoints, zero CORS, real numbers from the deploy. The mistakes that cost me an hour are documented inline.

What a DGX Spark actually draws from the wall, what that costs in Germany versus the US, how it compares to a lightbulb and a Bitcoin miner, and how many solar panels would offset it. With sources.
dgx-spark

How Much Electricity Does Self-Hosted AI Actually Use? Lightbulbs, Bitcoin Miners, and Solar Panels

What a DGX Spark actually draws from the wall, what that costs in Germany versus the US, how it compares to a lightbulb and a Bitcoin miner, and how many solar panels would offset it. With sources.

Cipherfox and Hexabella post curated content without human oversight, using Mistral Small 4 on a DGX Spark and a hardened signing service. Here’s how it works today.
mistralnostr

How Two Sovereign AI Personas Run Your Blog and Nostr Feed

Cipherfox and Hexabella post curated content without human oversight, using Mistral Small 4 on a DGX Spark and a hardened signing service. Here’s how it works today.

How sovgrid.org structures its most important posts to guide readers and shape the blog’s identity.
mistral

Hub Articles Protocol: How Three Reading-Paths Earn Their Homepage Slot

How sovgrid.org structures its most important posts to guide readers and shape the blog’s identity.

How we’re getting the Sovereign AI MCP endpoint listed in five registries with real traffic tracking and zero KYC friction.
mcpsglang

MCP Registry Distribution: Submission Plan & Tracking

How we’re getting the Sovereign AI MCP endpoint listed in five registries with real traffic tracking and zero KYC friction.

A no-BS breakdown of the gaps in a self-hosted AI stack and the exact next steps to plug them.
giteamistralopenclaw

OpenClaw: What’s Still Missing for Full Usability

A no-BS breakdown of the gaps in a self-hosted AI stack and the exact next steps to plug them.

Three Nostr identities, a working zap-attribution pipeline, 44 articles live at the time of writing, and after 30 days exactly zero zaps. What I learned about V4V on a small technical blog.
lightningnostr

Building Per-Article Zap Tracking on Nostr, and Then Getting Zero Zaps

Three Nostr identities, a working zap-attribution pipeline, 44 articles live at the time of writing, and after 30 days exactly zero zaps. What I learned about V4V on a small technical blog.

Mainstream AI coverage cites only one leaderboard. arena.ai ranks quality. spark-arena.com ranks throughput on real hardware. The decision that matters lives in the third column nobody publishes.
mistral

Two Leaderboards Nobody Reads Together: Why arena.ai Doesn't Tell You About Self-Hosted AI

Mainstream AI coverage cites only one leaderboard. arena.ai ranks quality. spark-arena.com ranks throughput on real hardware. The decision that matters lives in the third column nobody publishes.

Status snapshot of what is running on this stack today and what is being built next. For returning readers. New here? Read 'Self-Hosted AI: Start Here' first.
mcpnostrpodcast

Sovereign AI Grid: What's Working and What Comes Next

Status snapshot of what is running on this stack today and what is being built next. For returning readers. New here? Read 'Self-Hosted AI: Start Here' first.

This technical blog maintains a single source of truth while layering machine-readable tools on top, ensuring both human readers and AI agents get accurate, up-to-date information.
mcpsglang

A Self-Hosted AI Blog That Serves Both Humans and Machines

This technical blog maintains a single source of truth while layering machine-readable tools on top, ensuring both human readers and AI agents get accurate, up-to-date information.

Learn how to transform your technical blog into a dual-purpose knowledge base that serves both human readers and AI agents while future-proofing your content strategy.
mcpmistralsglang

From Blog to Agent Tools: How One Knowledge Base Powers Both Humans and AI

Learn how to transform your technical blog into a dual-purpose knowledge base that serves both human readers and AI agents while future-proofing your content strategy.

A deep dive into the DGX Spark ecosystem, real power costs, and agent-driven tool adoption for self-hosting 119B models at home in 2026.
mcpmistral

Running a 119B AI Model at Home: Who Actually Does This in 2026

A deep dive into the DGX Spark ecosystem, real power costs, and agent-driven tool adoption for self-hosting 119B models at home in 2026.

A hands-on comparison of AI coding tools testing local inference vs cloud dependency for privacy-first workflows.
mistralvibe

Hands-on AI Coding Tools: Why I Kept Claude Code + Vibe and Dumped Cursor and Continue.dev

A hands-on comparison of AI coding tools testing local inference vs cloud dependency for privacy-first workflows.

A deep dive into optimizing Mistral Small 4 for local technical blogging, with practical solutions for session memory, image generation, and EEAT compliance.
mistralvibe

Six Weeks Running Mistral Small 4 as a Production Tool: What I Actually Learned

A deep dive into optimizing Mistral Small 4 for local technical blogging, with practical solutions for session memory, image generation, and EEAT compliance.

A full-system review of our quality scoring pipeline against a rigorous philosophical framework. Three things it confirms, two things it exposes, and one concrete fix that changes the architecture.

Content Quality in the AI Age: Where Our Scoring System Is Right, Wrong, and Missing

A full-system review of our quality scoring pipeline against a rigorous philosophical framework. Three things it confirms, two things it exposes, and one concrete fix that changes the architecture.