HUB Self-Hosted AI: Start Here

May 3, 2026 19 min read

Quick Take

This is for someone taking a local LLM stack seriously. Not “Ollama on my MacBook for fun”. The stack I run daily on hardware I bought, against a model I control.

The first decision is hardware, and it constrains everything. The second is the inference engine, which is harder to migrate from than to choose. The third is the agent client, which I swap freely.

The hardest parts are not the install steps. They are the silent failures, the toolchain gaps on bleeding-edge silicon, and the discipline of keeping one thing running while everything else changes.

The reading path at the end is the order I would pick up the rest of the stack if I started over. Bookmark this. Come back.

I started this stack in late 2025 because I had stopped wanting to negotiate with the cloud about what I was allowed to ask my own AI. Eight months and one DGX Spark later, I have a working answer. This article is the post I wish I had read before I pulled out the credit card the first time.

It is also the article that makes my opinions explicit so you can disagree on purpose. I will tell you what I picked, why, what I would do differently with hindsight, and where I am still not sure. Where my choice is the right default for most people I will say so. Where it is just my preference I will flag it.

On this page:

If you got here from a Bitcoin context
Who this is for
Hardware: the decision tree that actually matters
Inference engine: the choice that is hardest to migrate from
Minimum-viable deploy: what actually needs to be running
What hurts the most after you start
The privacy floor: why no-KYC matters operationally
Reading path: what to read next
What I actually use

If you got here from a Bitcoin context

If you self-custody your Bitcoin, you already understand the mental model this stack applies to AI. Same playbook, different layer:

“Not your keys, not your coins” becomes “not your weights, not your AI”. Inference on a model someone else hosts is the API equivalent of leaving your sats on an exchange. It works until policy or pricing or the provider itself changes. Then it does not.
Cold storage to remove trust dependencies maps to local inference to remove prompt-and-output dependencies. Same discipline, different layer.
Lightning as a sovereign payment rail maps to MCP as a sovereign tool-call rail. Both are open protocols you can run yourself. Both have hosted gateways for convenience. Neither requires you to trust the gateway operator long-term.
No-KYC providers chosen because the regulatory surface stays simple is the exact same reasoning at the wallet, hardware-wallet, and VPS layers. Alby ^↗, BitBox02 ^↗, and FlokiNET ^↗ are the three I use here. Same logic that picked your hardware wallet picked my VPS.

The hardware-decision tree below and the inference-engine choice that follows are the operational details of one specific implementation. The sovereignty argument behind them is the same argument you already made about your money. This is the AI-layer version of the same discipline.

If that frame lands for you, the rest of this article is the operational how-to for people who already get the why. Skip “Who this is for” if you want; you are already in.

If you came to Bitcoin self-custody through tutorials, BTCSessions is the channel that walked me through the wallet-and-Lightning side of the same discipline before any of this AI stack existed. The mental model carries over almost verbatim. The Sovereign Sessions channel covers the broader sovereignty stack with the same audience-relationship and is the outbound benchmark this blog is being calibrated against.

Who this is for

This is not for “I want to try AI”. For that, install Claude Desktop or Cursor and skip this whole post. The cloud experience is genuinely good, the latency is fine, the per-token cost is manageable for occasional use, and you do not need a homelab to use AI productively in 2026. I still use Claude Code daily for serious coding work even with my local stack running. The two are not in opposition.

This is for the person who has decided one of these three is true:

Privacy matters operationally, not philosophically. The data I would feed an LLM is too sensitive to send to a cloud API even with the strongest contractual guarantees. (My case.)
Cost predictability matters. Per-token pricing on cloud APIs scales superlinearly with serious agentic-AI use. A fixed-cost local box pays back fast at high volume.
Stack-control matters. I want to build something that depends on a model I control, not one an upstream provider can deprecate, change pricing on, or filter outputs from at any time.

If none of those is true for you, do not self-host. The cloud option is fine.

If at least one is true, the rest of this is for you.

Hardware: the decision tree that actually matters

There is no single right answer for hardware in mid-2026. There are four credible paths. I picked Path 1. If you pick differently, do it on purpose.

Path 1: NVIDIA DGX Spark (GB10, 128 GB unified memory)

What I picked. The case for: unified memory means a 119B MoE model fits in the inference workspace without sharding tricks. ARM64 throughout. Compact desk-side form factor. The best single-machine configuration available at consumer-adjacent pricing in 2026 for serious-but-solo workloads.

The case against: bleeding-edge silicon. SM 12.1 (Blackwell-class) means the toolchain is still catching up. PyTorch SM 12.1 support is officially incomplete. The flashinfer attention backend OOMs on the first batch. I live on SGLang nightly builds for the next several months. If “the latest toolchain doc may not match what you actually need” is the kind of thing that ruins your week, this is not your hardware.

Real numbers from my stack: 35 to 41 tok/s single-stream on Mistral Small 4 119B with EAGLE speculative decoding, 12 to 15 tok/s without. About 90 to 110 W under load, 35 W idle. Power cost over a month is roughly the cost of a small space heater on a timer.

→ Pick if you can tolerate bleeding-edge toolchain quirks for the unified-memory plus ARM64 plus desk-form-factor combination that nothing else matches.

Path 2: Mac Studio M3 Ultra (192 GB unified memory)

I did not pick this. The case for: unified memory at higher capacity than DGX Spark, mature Apple Silicon toolchain, no driver lottery, no cooling drama in a home office. macOS hosts run llama.cpp and MLX cleanly without Linux container ceremony.

The case against: not as fast as a real GPU per token. The MLX framework is improving fast but does not match SGLang’s batching maturity. Apple Silicon is not designed for long-running 24/7 inference workloads in the same way a GPU server is. I went with DGX because I wanted the agent-stack maturity, not because Apple Silicon is a bad choice.

Peer-reported numbers (I have not measured): roughly 18 to 25 tok/s on a 70B-class model at int4 quantization. Less on a true 100B+ model. Idle power is excellent. Sustained-load power is comparable to mid-range NVIDIA gear.

→ Pick if you want a quiet, low-maintenance, single-machine setup and you are comfortable with Apple-Silicon-specific tooling tradeoffs.

Path 3: Custom rig with used 3090s or 4090s

I did not pick this either. The case for: highest performance-per-dollar in the GPU class. Mature CUDA toolchain. Standard PC form factor means standard cooling, standard PSU, standard troubleshooting. Two 3090s in NVLink land you in 48 GB VRAM territory for roughly €1,700-2,100 of used GPUs at mid-2026 EU prices (single-card range €840-1,050 on eBay and EU price aggregators, verify against current listings before buying).

The case against: not unified memory. Sharding a 119B MoE model across two cards is doable but adds complexity. Power draw is significant (700 to 900 W under load with two cards). Fans are loud. Form factor is desktop-tower, not appliance.

Peer-reported numbers: a dual-3090 NVLink rig runs 70B-class models at 25 to 40 tok/s with vLLM. Varies hugely with batch size and quantization. Power cost is real (€40 to €80 per month at typical EU grid prices for daily-driver use).

→ Pick if you are comfortable with PC building, you want maximum bang per dollar, and you can absorb the form factor and noise tradeoffs.

Path 4: Cloud-rented bare-metal GPU

I did this for two weeks before buying the DGX. The case for: zero capital expenditure. Burst-rentable for one-off heavy jobs. Lets you test what hardware would actually make sense before spending real money.

The case against: per-hour cost adds up fast for daily-driver use. Latency to the cloud GPU is real (30 to 100 ms typical). The whole point of “self-hosted” is contradicted if your weights live on someone else’s machine.

→ Pick if you have not yet decided which of the above three is right for you and you want a few weeks of real testing before buying. RunPod and Lambda Labs offer hourly bare-metal A100s and H100s at reasonable prices. In my experience this short-circuits into one of the other three within a month.

Inference engine: the choice that is hardest to migrate from

Once the hardware is chosen, the inference engine choice locks in the rest of your stack for at least 6 to 12 months. Migrating between engines is not a quick swap. The model snapshots, the launch flags, the tokenizer integration, and the agent-side OpenAI-compatibility layer all subtly differ. Pick deliberately.

SGLang

What I picked. The case for: best speculative-decoding integration (EAGLE, EAGLE-2, MTP). Mature batching. Good ARM64 nightly support. Active development with frequent fixes for new silicon. The right default for serious-throughput single-machine setups today, in my opinion.

The case against: nightly-build dependency on bleeding-edge hardware. The stable releases lag GB10 support by months. Some attention backends (flashinfer) do not work on every architecture and you have to know which to pick (--attention-backend triton for GB10).

→ Read next: SGLang on DGX Spark: 35-41 tok/s with EAGLE Speculative Decoding for the production config that works for me.

vLLM

The case for: most mature batching library. Best community support across model architectures. The closest thing to a default in 2026 for general-purpose LLM serving.

The case against: speculative decoding integration lags SGLang. Memory accounting on bleeding-edge silicon is less stable. The vllm-omni branch handles multi-modal but has its own quirks (see the Voxtral Stage 1 OOM article for one example I hit).

→ Read next: Voxtral Stage 1 OOM on GB10: Why —enforce-eager Is Not Enough for one concrete vLLM-omni gotcha.

llama.cpp

The case for: runs everywhere, including CPU-only setups. Tiny memory footprint. Excellent quantization options. The right default for “I want to run a small model on a Mac or Linux laptop without a GPU”.

The case against: not optimized for the multi-user batched inference an MCP-serving agent stack needs. Performance-per-watt is good. Performance-per-second on a single-user-streaming workload trails SGLang and vLLM by a meaningful margin on the same hardware.

→ Pick if you are on Mac or Linux laptop hardware, or your use case is occasional inference rather than daily-driver agent work.

Minimum-viable deploy: what actually needs to be running

Once hardware and inference engine are chosen, the minimum-viable agent-ready deployment has five components. Each can fail independently. Each needs its own startup-restart-monitor story. I have hit all five failure modes in the first six months.

1. Inference server (always-on)

SGLang, vLLM, or llama.cpp serving an OpenAI-compatible API on localhost:30000 (or wherever you put it). Wrapped in systemd or Docker so it restarts on host reboot. Load-test it once with a curl POST to confirm it speaks the OpenAI chat-completions format that everything else expects.

2. Agent client (one or more)

The agents that actually call the inference server. The honest list of what works in 2026, with how I use each one:

Claude Code (CLI) is cloud, my daily driver for serious coding work even with my self-hosted stack running. Pay-as-you-go. Best at architecture and large-context reasoning.
OpenClaw is local persona orchestration with a Side-Car-Proxy that fixes the Mistral alternating-roles BadRequestError. The right tool for multi-persona blog or Matrix bot work. I run cipherfox plus hexabella through this.
Vibe (Mistral CLI) is local CLI for privacy-sensitive single-task work. Mistral’s own CLI, MCP-aware, Python 3.12+. I reach for this when nothing should leave the network.
OpenHands is Docker-based coding agent with Mistral via SGLang. Good for sandboxed multi-step work. I keep it around for end-to-end stack validation, not as the daily driver.

3. MCP server (optional, recommended at scale)

Once you have a knowledge base worth referencing (technical blog, internal docs, runbooks), a Model Context Protocol server makes that knowledge agent-callable. The pattern: an agent asks search_blog("flashinfer OOM on GB10"), gets back ranked excerpts plus operational fixes, instead of hallucinating a generic answer.

I shipped one for this blog. I also shipped an article saying it is mostly redundant at the current corpus size and explaining when it stops being redundant: The Sovereign AI Blog MCP Is Mostly Redundant Today, And That Will Change. Read that before you build your own.

4. Search and web-fetch layer

Most agents need web search at some point. Options:

Self-hosted SearXNG in Docker on the same host. Privacy-respecting metasearch, free, no telemetry to third parties. What I run.
Cloud search API if your privacy floor permits it. I do not use cloud search because it defeats the point of the rest of the stack.
Per-call web fetch for known URLs without going through a search index.

5. Public edge (if you want anyone to reach the MCP)

If your MCP server should be reachable from cloud agents (Claude Code, Smithery, Glama gateways), it needs a public TLS endpoint. The cleanest pattern is a small no-KYC privacy VPS with Caddy reverse-proxy and Let’s Encrypt, with the MCP server reachable over Streamable HTTP. The DGX Spark stays at home and serves inference. The public VPS terminates TLS and proxies the MCP-tool calls. See Floki-VPS Setup for Sovereign AI Workloads for the Caddy config that does this.

What hurts the most after you start

Honest list, in approximate order of how often each one bit me in the first three months. None of these are in the install guides because none of them are install problems.

1. Toolchain gaps on bleeding-edge silicon

If you bought DGX Spark or any other recent NVIDIA architecture, the upstream tooling assumes you have older silicon. PyTorch official wheels do not yet support SM 12.1. Nightly builds do but are unstable. Flashinfer attention backend has no SM 12.1 kernels and OOMs silently. My workaround: use --attention-backend triton for SGLang, accept that some throughput is left on the table until the official toolchain catches up. See SGLang on DGX Spark: 35-41 tok/s with EAGLE Speculative Decoding for the working configuration.

2. Sequential-only GPU services

SGLang, Voxtral (TTS), ComfyUI (image gen), and any other GPU-bound service cannot share the unified memory pool meaningfully. They have to take turns. The orchestration cost (which one is running, who restarts whom, how does the dashboard show current state) is non-trivial. Plan for this from day one rather than discovering it the first time you try to generate a hero image while inference is busy. I did not plan for it. I now run a dashboard that does the dance for me.

3. The first OOM

My first OOM was confusing because the failure surface and the root cause were in different processes. The CUDA OOM error appeared in a worker subprocess, but the flag that should have prevented it was passed to the parent process and never inherited. See Voxtral Stage 1 OOM on GB10: Why —enforce-eager Is Not Enough for the canonical example. The general lesson: when an OOM appears on hardware that should have plenty of memory, check whether the parent-process flags actually propagated to the child.

4. Mistral alternating-roles BadRequestError

If you point any agent framework at SGLang serving Mistral, you will hit a BadRequestError saying something like “conversation roles must alternate”. This is the canonical Mistral-on-strict-OpenAI-spec failure. Three different agents have three different fixes for the same root cause:

OpenHands: set enable_prompt_extensions = false in config.toml. See OpenHands Setup with Mistral-via-SGLang: The Multi-Arch Container Recipe.
OpenClaw: install the Side-Car-Proxy that rewrites incoming requests before SGLang sees them. See Fix OpenClaw + SGLang with Mistral: Stop the “conversation roles must alternate” 400 BadRequest.
Vibe: use --no-pretty and similar flags to keep prompt structure simpler. See Vibe 400 Bad Request Fix: Mistral Alternating Roles and reasoning_effort.

Same root cause, three different framework-side fixes. Knowing this in advance saves hours.

5. Disk fills up

Model weights, container images, build caches, and downloaded artifacts add up fast. Set up a daily disk-check probe before you actually fill the disk, not after. I learned this one the hard way after /data hit 95 percent on a Sunday morning. See Three Silent Failures That Would Have Killed My Self-Hosted AI Stack for the daily-check pattern that catches it now.

6. Backup discipline

Self-hosted means you are the backup operator. The minimum-viable backup is the secrets directory (Nostr keys, wallet seeds, SSH keys) on encrypted offline storage, plus the source-controlled stuff in Gitea. Anything less is asking for the bad day you have not had yet. See Backup System Rebuilt from Scratch: The Night I Found Out Six Months of Backups Were Fake for the discipline that earns its keep, with a name that tells you why I rewrote it.

The privacy floor: why no-KYC matters operationally

The Sovereign AI label gets used as marketing language a lot. On my stack it means three concrete things, each chosen for an operational reason rather than a philosophical one.

No-KYC for monetary infrastructure. The Lightning wallet (Alby ^↗), the hardware wallet for cold storage (BitBox02 ^↗), and the VPS provider (FlokiNET ^↗) all do not require KYC. The reason is not paranoia. It is that KYC creates ongoing tax and regulatory obligations on the provider side that change over time, and the cost of migrating off a KYC’d provider after a policy change is high. No-KYC providers stay simple.

No-cloud for inference. The model runs on hardware I own. No upstream provider can deprecate the model, change its outputs, or filter what I can ask it. If the provider goes out of business, my stack still works. I have lived through three “the API you depend on is being deprecated next quarter” emails in my career. Never again, at least for this layer.

No-tracking for readers. The blog you are reading right now ships zero JavaScript pixels. No Google Analytics, no Cloudflare Insights, no Discourse, no Disqus. The only signal source is Caddy access logs aggregated nightly. Readers get pages. The blog gets aggregate counts. Neither side has to consent to being tracked because there is nothing to consent to.

These three together make up my operational privacy floor. They do not make this stack perfect. They make it credibly different from the cloud-AI default in ways that compound over years rather than degrade over years.

Reading path: what to read next

You are at the entry point of an engineering log. Here is the order I would pick up the rest of the stack if I were starting over.

If you are still deciding hardware: stay on this article and the SGLang setup post until the hardware question is settled. Buying the wrong hardware is the most expensive mistake on this path.

Strategy zoom-out (read after the operational stuff)

Sovereign AI Grid: What’s Working and What Comes Next The status snapshot of what is currently running plus the roadmap of what is being built next. Companion piece to this one: this article is the entry point, that one is the state-of-the-stack you check back on.

What I actually use

Hardware: NVIDIA DGX Spark (GB10, 128 GB unified). Daily driver, paid for in full, would buy again.

Inference: SGLang nightly-dev-cu13 with --attention-backend triton --enforce-eager. Mistral Small 4 119B at NVFP4 quantization. EAGLE speculative decoding enabled.

Agent clients: Claude Code (cloud) for architecture and polish. OpenClaw (local) for persona work. Vibe (local) for privacy-sensitive single tasks. OpenHands kept around for end-to-end stack validation.

Public edge: FlokiNET ^↗ no-KYC VPS, Caddy with Let’s Encrypt, sovereign-mcp on mcp.sovgrid.org/self-hosted-ai.

Lightning wallet: Alby ^↗ for V4V tipping. BitBox02 ^↗ for cold storage. All no-KYC.

Daily-check probe: systemd timer at 06:00 that fires only if something is anomalous. The journal stays quiet on a normal day, which is when I know everything is fine.

This stack runs daily. It is not the only viable shape. It is one shape that works, documented honestly enough that the next person can decide whether to copy it, adapt it, or take the lessons and pick something different.

When you have made it through the reading path above, you will know which of those three you want to do.

Stack

Self-Hosted AI Decision Stack

Hardware, inference, agent client, what hurts after you start

Privacy floor no-KYC, no-cloud, no-tracking

Discovery MCP server (Streamable HTTP) over Caddy

Agent client Claude Code / OpenClaw / Vibe / OpenHands

Model Mistral Small 4 119B MoE (NVFP4) by default

Inference SGLang / vLLM / llama.cpp

Hardware DGX Spark / Mac M3 / used 3090s / cloud-rented