The honest decision tree for taking a local LLM stack seriously. Hardware tradeoffs I actually made, the inference engine I picked and why, the parts that hurt the most after I started, and the reading path through the rest of this blog.

HUB Self-Hosted AI: Start Here

Quick Take

  • This is for someone taking a local LLM stack seriously. Not “Ollama on my MacBook for fun”. The stack I run daily on hardware I bought, against a model I control.
  • The first decision is hardware, and it constrains everything. The second is the inference engine, which is harder to migrate from than to choose. The third is the agent client, which I swap freely.
  • The hardest parts are not the install steps. They are the silent failures, the toolchain gaps on bleeding-edge silicon, and the discipline of keeping one thing running while everything else changes.
  • The reading path at the end is the order I would pick up the rest of the stack if I started over. Bookmark this. Come back.

Update (2026-06-19). The production primary is now Qwen 3.6 AutoRound int4-mixed, switched from PrismaQuant on 2026-06-11 (69.2 tok/s on the canonical ruler, 12.7 percent better on the coding gate than the now-retired PrismaQuant build, whose weights are deleted). PrismaQuant figures further down are kept as the engineering-log record. The live model, quant, and throughput are on /stack/; the switch is measured in AutoRound int4 vs PrismaQuant.

I started this stack in early April 2026 because I had stopped wanting to negotiate with the cloud about what I was allowed to ask my own AI. Six weeks and one DGX Spark later, I have a working answer. This article is the post I wish I had read before I pulled out the credit card the first time.

It is also the article that makes my opinions explicit so you can disagree on purpose. I will tell you what I picked, why, what I would do differently with hindsight, and where I am still not sure. Where my choice is the right default for most people I will say so. Where it is just my preference I will flag it.

On this page:

If you got here from a Bitcoin context

If you self-custody your Bitcoin, you already understand the mental model this stack applies to AI. Same playbook, different layer:

The hardware-decision tree below and the inference-engine choice that follows are the operational details of one specific implementation. The sovereignty argument behind them is the same argument you already made about your money. This is the AI-layer version of the same discipline.

If that frame lands for you, the rest of this article is the operational how-to for people who already get the why. Skip “Who this is for” if you want; you are already in.

If you came to Bitcoin self-custody through tutorials, BTCSessions is the channel that walked me through the wallet-and-Lightning side of the same discipline before any of this AI stack existed. The mental model carries over almost verbatim. The Sovereign Sessions channel covers the broader sovereignty stack with the same audience-relationship and is the outbound benchmark this blog is being calibrated against.

Who this is for

This is not for “I want to try AI”. For that, install Claude Desktop or Cursor and skip this whole post. The cloud experience is genuinely good, the latency is fine, the per-token cost is manageable for occasional use, and you do not need a homelab to use AI productively in 2026. I still use Claude Code daily for serious coding work even with my local stack running. The two are not in opposition.

This is for the person who has decided one of these three is true:

  1. Privacy matters operationally, not philosophically. The data I would feed an LLM is too sensitive to send to a cloud API even with the strongest contractual guarantees. (My case.)
  2. Cost predictability matters. Per-token pricing on cloud APIs scales superlinearly with serious agentic-AI use. A fixed-cost local box pays back fast at high volume.
  3. Stack-control matters. I want to build something that depends on a model I control, not one an upstream provider can deprecate, change pricing on, or filter outputs from at any time.

If none of those is true for you, do not self-host. The cloud option is fine.

If at least one is true, the rest of this is for you.

Hardware: the decision tree that actually matters

There is no single right answer for hardware in mid-2026. There are five credible paths. I picked Path 1. If you pick differently, do it on purpose.

The five paths at a glance, before the detail below:

PathMemory and classThroughputPower and costToolchainPick if
1. DGX Spark GB10128 GB unified, fits 119B MoE~60 tok/s single-stream90-110 W load, 35 W idlebleeding-edge, still catching upyou want unified + ARM64 + desk form and tolerate quirks
2. Mac Studio M3 Ultra192 GB unified~18-25 tok/s on 70B int4 (peer)excellent idlemature Apple Silicon (MLX)quiet, low-maintenance, fine with Apple tooling
3. Dual 3090/4090 rig48 GB pooled (3090 NVLink), not unified25-40 tok/s on 70B (peer)700-900 W, €40-80/momature CUDAmax bang per dollar, fine with a PC build and noise
4. Cloud bare-metal GPUrented A100/H100per-hour, variesno capex, hourly adds upmatureyou want weeks of real testing before buying
5. Gaming laptop (RTX 5080 Mobile)16 GB VRAM, 8B-14B classcomfortable 8B-14B on Ollamalaptop, throttles under loadBlackwell-on-Linux lotteryportable single-device for someone starting smaller

Path 1: NVIDIA DGX Spark (GB10, 128 GB unified memory)

What I picked. The case for: unified memory means a 119B MoE model fits in the inference workspace without sharding tricks. ARM64 throughout. Compact desk-side form factor. The best single-machine configuration available at consumer-adjacent pricing in 2026 for serious-but-solo workloads.

The case against: bleeding-edge silicon. SM 12.1 (Blackwell-class) means the toolchain is still catching up. PyTorch SM 12.1 support is officially incomplete. The flashinfer attention backend OOMs on the first batch. I run vLLM with Qwen 3.6 as the daily driver and keep SGLang with Mistral as a fallback, both on nightly-adjacent toolchains for the next several months. (Update 2026-06-11: the Qwen daily driver itself reads images once the --language-model-only flag is dropped, so vision is no longer Mistral-only on this stack, see gpt-oss vs Qwen on a single Spark.) If “the latest toolchain doc may not match what you actually need” is the kind of thing that ruins your week, this is not your hardware.

Real numbers from my stack: roughly 60 tok/s single-stream on the daily-driver model with speculative decoding. About 90 to 110 W under load, 35 W idle. Power cost over a month is roughly the cost of a small space heater on a timer. The exact current model, engine, and throughput live on /stack/, because that software half changes monthly while the hardware does not.

→ Pick if you can tolerate bleeding-edge toolchain quirks for the unified-memory plus ARM64 plus desk-form-factor combination that nothing else matches.

→ Read next: Self-Host Mistral Small 4 with SGLang on NVIDIA DGX Spark (GB10): What Actually Works, Floki-VPS Setup for Sovereign AI Workloads for the public-edge counterpart.

Path 2: Mac Studio M3 Ultra (192 GB unified memory)

I did not pick this. The case for: unified memory at higher capacity than DGX Spark, mature Apple Silicon toolchain, no driver lottery, no cooling drama in a home office. macOS hosts run llama.cpp and MLX cleanly without Linux container ceremony.

The case against: not as fast as a real GPU per token. The MLX framework is improving fast but does not match SGLang’s batching maturity. Apple Silicon is not designed for long-running 24/7 inference workloads in the same way a GPU server is. I went with DGX because I wanted the agent-stack maturity, not because Apple Silicon is a bad choice.

Peer-reported numbers (I have not measured): roughly 18 to 25 tok/s on a 70B-class model at int4 quantization. Less on a true 100B+ model. Idle power is excellent. Sustained-load power is comparable to mid-range NVIDIA gear.

→ Pick if you want a quiet, low-maintenance, single-machine setup and you are comfortable with Apple-Silicon-specific tooling tradeoffs.

Path 3: Custom rig with used 3090s or 4090s

I did not pick this either. The case for: highest performance-per-dollar in the GPU class. Mature CUDA toolchain. Standard PC form factor means standard cooling, standard PSU, standard troubleshooting. Two 3090s with NVLink bridge land you in 48 GB pooled-VRAM territory for roughly €1,700-2,100 of used GPUs at mid-2026 EU prices (single-card range €840-1,050 on eBay and EU price aggregators, verify against current listings before buying). Two 4090s give you 24 GB each but no NVLink at all (NVIDIA dropped it on the 4000-series), so each card is its own device and you shard the model across PCIe instead of pooling memory.

The case against: not unified memory. Sharding a 119B MoE model across two cards is doable but adds complexity. Power draw is significant (700 to 900 W under load with two cards). Fans are loud. Form factor is desktop-tower, not appliance.

Peer-reported numbers: a dual-3090 NVLink rig runs 70B-class models at 25 to 40 tok/s with vLLM. Varies hugely with batch size and quantization. Power cost is real (€40 to €80 per month at typical EU grid prices for daily-driver use).

→ Pick if you are comfortable with PC building, you want maximum bang per dollar, and you can absorb the form factor and noise tradeoffs.

Path 4: Cloud-rented bare-metal GPU

I did this for two weeks before buying the DGX. The case for: zero capital expenditure. Burst-rentable for one-off heavy jobs. Lets you test what hardware would actually make sense before spending real money.

The case against: per-hour cost adds up fast for daily-driver use. Latency to the cloud GPU is real (30 to 100 ms typical). The whole point of “self-hosted” is contradicted if your weights live on someone else’s machine.

→ Pick if you have not yet decided which of the above three is right for you and you want a few weeks of real testing before buying. RunPod and Lambda Labs offer hourly bare-metal A100s and H100s at reasonable prices. In my experience this short-circuits into one of the other three within a month.

Path 5: Gaming laptop (Lenovo Legion Pro 7 Gen 10, RTX 5080 Mobile)

I did not pick this for myself, but I built it for a friend as a portable companion box, and it earns its place as a fifth path. The case for: a current Blackwell-class GPU (RTX 5080 Mobile, 16 GB VRAM) inside a machine that already exists in a lot of homes. One device, no rack, no VPS, no desk-side appliance to explain to a partner. It runs a local 8B-to-14B model under Ollama comfortably and gives someone their own private second brain without a single cloud token, which is the lowest-friction way to put a second person on a sovereign stack.

The case against: a laptop GPU is not a Spark. 16 GB of VRAM does not hold a 119B MoE, so you are in 8B-to-14B territory, not 100B-class. Thermals and power management throttle sustained load, and Blackwell-on-Linux laptops carry their own driver and firmware lottery: encrypted-boot keyboard quirks, a smart-amp audio chip with no Linux driver, the /data-on-LVM trap. All of it is in the build log below.

→ Pick if you want a portable, single-device sovereign box for someone starting smaller, or as a companion to a heavier machine. The full build, including how to set it up for someone else without leaking your own identity, is the 24-hour Lenovo Legion setup log, with the friend-setup playbook and the /data-convention trap.

Inference engine: the choice that is hardest to migrate from

Once the hardware is chosen, the inference engine choice locks in the rest of your stack for at least 6 to 12 months. Migrating between engines is not a quick swap. The model snapshots, the launch flags, the tokenizer integration, and the agent-side OpenAI-compatibility layer all subtly differ. Pick deliberately.

vLLM

What I run as primary today. The case for: most mature batching library. Best community support across model architectures. DFlash speculative decoding integration is now competitive with SGLang’s EAGLE for text-only workloads. The current production quant is Qwen 3.6 AutoRound int4-mixed at 69.2 tok/s single-stream (canonical ruler, 2026-06-11); the earlier PrismaQuant 4.75-bit build read around 71 tok/s on a DFlash k=3 non-streaming harness. The live figure is on /stack/ and the receipts are in the benchmark write-up.

The case against: memory accounting on bleeding-edge silicon is less stable than llama.cpp’s. The vllm-omni branch handles multi-modal but has its own quirks (see the Voxtral Stage 1 OOM article for one example I hit). Stable releases lag GB10 support by weeks.

→ Read next: the Mistral / Qwen / GLM-5 comparison for the model-and-engine decision, and Voxtral Stage 1 OOM on GB10: Why --enforce-eager Is Not Enough for one concrete vLLM-omni gotcha.

SGLang

What I run as the safer-eagle fallback for the Mistral Small 4 NVFP4 vision-capable backup path. The case for: best speculative-decoding integration historically (EAGLE, EAGLE-2, MTP). Mature batching. Good ARM64 nightly support. Active development with frequent fixes for new silicon. With Mistral Small 4 NVFP4 on safer-eagle: 36.5 tok/s decode, verified 2026-05-22.

The case against: nightly-build dependency on bleeding-edge hardware. Some attention backends (flashinfer) do not work on every architecture and you have to know which to pick (--attention-backend triton for GB10). EAGLE accept-rate drops on plain English prose (see EAGLE Speculative Decoding: When It Helps, When It Does Not for the content-dependent throughput tradeoff).

→ Read next: SGLang on DGX Spark: 35-41 tok/s with EAGLE Speculative Decoding for the production config that works for me.

llama.cpp

The case for: runs everywhere, including CPU-only setups. Tiny memory footprint. Excellent quantization options. The right default for “I want to run a small model on a Mac or Linux laptop without a GPU”.

The case against: not optimized for the multi-user batched inference an MCP-serving agent stack needs. Performance-per-watt is good. Performance-per-second on a single-user-streaming workload trails SGLang and vLLM by a meaningful margin on the same hardware.

→ Pick if you are on Mac or Linux laptop hardware, or your use case is occasional inference rather than daily-driver agent work.

Minimum-viable deploy: what actually needs to be running

Once hardware and inference engine are chosen, the minimum-viable agent-ready deployment has five components. Each can fail independently. Each needs its own startup-restart-monitor story. I have hit all five failure modes in the first six months.

1. Inference server (always-on)

SGLang, vLLM, or llama.cpp serving an OpenAI-compatible API on localhost:30000 (or wherever you put it). Wrapped in systemd or Docker so it restarts on host reboot. Load-test it once with a curl POST to confirm it speaks the OpenAI chat-completions format that everything else expects.

2. Agent client (one or more)

The agents that actually call the inference server. The honest list of what works in 2026, with how I use each one:

Once you have a knowledge base worth referencing (technical blog, internal docs, runbooks), a Model Context Protocol server makes that knowledge agent-callable. The pattern: an agent asks search_blog("flashinfer OOM on GB10"), gets back ranked excerpts plus operational fixes, instead of hallucinating a generic answer.

I shipped one for this blog. I also shipped an article saying it is mostly redundant at the current corpus size and explaining when it stops being redundant: The Sovereign AI Blog MCP Is Mostly Redundant Today, And That Will Change. Read that before you build your own.

4. Search and web-fetch layer

Most agents need web search at some point. Options:

5. Public edge (if you want anyone to reach the MCP)

If your MCP server should be reachable from cloud agents (Claude Code, Smithery, Glama gateways), it needs a public TLS endpoint. The cleanest pattern is a small no-KYC privacy VPS with Caddy reverse-proxy and Let’s Encrypt, with the MCP server reachable over Streamable HTTP. The DGX Spark stays at home and serves inference. The public VPS terminates TLS and proxies the MCP-tool calls. See Floki-VPS Setup for Sovereign AI Workloads for the Caddy config that does this.

For external agent discovery, list the server in the official MCP Registry. My entry is org.sovgrid/self-hosted-ai, registered with DNS-based ed25519 authentication so the listing is owned by the same domain that serves the endpoint. Smithery and Glama also index it, but the registry is the canonical source.

What hurts the most after you start

Honest list, in approximate order of how often each one bit me in the first three months. None of these are in the install guides because none of them are install problems.

1. Toolchain gaps on bleeding-edge silicon

If you bought DGX Spark or any other recent NVIDIA architecture, the upstream tooling assumes you have older silicon. PyTorch official wheels do not yet support SM 12.1. Nightly builds do but are unstable. Flashinfer attention backend has no SM 12.1 kernels and OOMs silently. My workaround: use --attention-backend triton for SGLang, accept that some throughput is left on the table until the official toolchain catches up. See SGLang on DGX Spark: 35-41 tok/s with EAGLE Speculative Decoding for the working configuration.

2. Sequential-only GPU services

SGLang, Voxtral (TTS), ComfyUI (image gen), and any other GPU-bound service cannot share the unified memory pool meaningfully. They have to take turns. The orchestration cost (which one is running, who restarts whom, how does the dashboard show current state) is non-trivial. Plan for this from day one rather than discovering it the first time you try to generate a hero image while inference is busy. I did not plan for it. I now run a dashboard that does the dance for me.

3. The first OOM

My first OOM was confusing because the failure surface and the root cause were in different processes. The CUDA OOM error appeared in a worker subprocess, but the flag that should have prevented it was passed to the parent process and never inherited. See Voxtral Stage 1 OOM on GB10: Why --enforce-eager Is Not Enough for the canonical example. The general lesson: when an OOM appears on hardware that should have plenty of memory, check whether the parent-process flags actually propagated to the child.

4. Mistral alternating-roles BadRequestError

If you point any agent framework at SGLang serving Mistral, you will hit a BadRequestError saying something like “conversation roles must alternate”. This is the canonical Mistral-on-strict-OpenAI-spec failure. Three different agents have three different fixes for the same root cause:

Same root cause, three different framework-side fixes. Knowing this in advance saves hours.

5. Disk fills up

Model weights, container images, build caches, and downloaded artifacts add up fast. Set up a daily disk-check probe before you actually fill the disk, not after. I learned this one the hard way after /data hit 95 percent on a Sunday morning. See Three Silent Failures That Would Have Killed My Self-Hosted AI Stack for the daily-check pattern that catches it now.

6. Backup discipline

Self-hosted means you are the backup operator. The minimum-viable backup is the secrets directory (Nostr keys, wallet seeds, SSH keys) on encrypted offline storage, plus the source-controlled stuff in Gitea. Anything less is asking for the bad day you have not had yet. See Backup System Rebuilt from Scratch: The Night I Found Out Six Months of Backups Were Fake for the discipline that earns its keep, with a name that tells you why I rewrote it.

The privacy floor: why no-KYC matters operationally

The Sovereign AI label gets used as marketing language a lot. On my stack it means three concrete things, each chosen for an operational reason rather than a philosophical one.

No-KYC for monetary infrastructure. The Lightning wallet (Alby ), the hardware wallet for cold storage (BitBox02 ), and the VPS provider (FlokiNET ) all do not require KYC. The reason is not paranoia. It is that KYC creates ongoing tax and regulatory obligations on the provider side that change over time, and the cost of migrating off a KYC’d provider after a policy change is high. No-KYC providers stay simple.

No-cloud for inference. The model runs on hardware I own. No upstream provider can deprecate the model, change its outputs, or filter what I can ask it. If the provider goes out of business, my stack still works. I have lived through three “the API you depend on is being deprecated next quarter” emails in my career. Never again, at least for this layer.

No-tracking for readers. The blog you are reading right now ships zero JavaScript pixels. No Google Analytics, no Cloudflare Insights, no Discourse, no Disqus. The only signal source is Caddy access logs aggregated nightly. Readers get pages. The blog gets aggregate counts. Neither side has to consent to being tracked because there is nothing to consent to.

These three together make up my operational privacy floor. They do not make this stack perfect. They make it credibly different from the cloud-AI default in ways that compound over years rather than degrade over years.

You are at the entry point of an engineering log. Here is the order I would pick up the rest of the stack if I were starting over.

The two anchor hubs to read alongside this one:

If you are still deciding hardware: read Should You Buy a DGX Spark in 2026: a Decision Tree and the DGX Spark vs M3 Ultra Mac Studio comparison. For budget-tier breakdowns: the four What I’d Buy in 2026 tiers (2k / 4k / 8k / 15k). Buying the wrong hardware is the most expensive mistake on this path.

If you are still deciding cloud-vs-local: read Cloud vs Local AI: Where Each Actually Wins in 2026 for the 13-task capability matrix, and Self-Hosted AI vs Cloud APIs: Real Total Cost for the dollar lens.

If you want to see how the pipeline behind sovgrid actually works: How This Blog Actually Gets Built covers the two-layer pipeline, the AGENTS.md ritual, the quality gates, and the stylometric layer. Five thousand words; the receipts are in the commit log.

Once hardware is chosen

Agent client

Public edge

Operational discipline

Strategy zoom-out (read after the operational stuff)

What I actually use

The hardware is settled: an NVIDIA DGX Spark (GB10, 128 GB unified), paid for in full, would buy again. The software half changes monthly, so the live inventory lives in one place instead of being duplicated here to go stale. The current model, inference engine, throughput, agent clients, and edge config are on /stack/, kept current as the canonical answer to “what is running right now”. This article stays deliberately about the decisions, which age slowly, not the version numbers, which do not.

  • Agent clients: Claude Code (cloud) for architecture and polish. opencode against Qwen 3.6 as the local primary. OpenClaw for the Mistral-side persona work with the strict-alternation patch.
  • Public edge: FlokiNET no-KYC VPS, Caddy with Let’s Encrypt direct (Cloudflared retired 2026-05-24), sovereign-mcp on mcp.sovgrid.org/self-hosted-ai.
  • Lightning wallet: Alby for V4V tipping. BitBox02 for cold storage. All no-KYC.
  • Daily-check probe: floki-healthcheck.sh runs daily from Spark via SSH: 12 checks, single Matrix push, JSON sidecar at /api/floki-health.json. The journal stays quiet on a normal day, which is when I know everything is fine.

This stack runs daily. It is not the only viable shape. It is one shape that works, documented honestly enough that the next person can decide whether to copy it, adapt it, or take the lessons and pick something different.

When you have made it through the reading path above, you will know which of those three you want to do.

Stack

Self-Hosted AI Decision Stack

Hardware, inference, agent client, what hurts after you start

6
Privacy floor no-KYC, no-cloud, no-tracking
5
Discovery MCP server (Streamable HTTP) over Caddy
4
Agent client Claude Code / OpenClaw / Vibe / OpenHands
3
Model Mistral Small 4 119B MoE (NVFP4) by default
2
Inference SGLang / vLLM / llama.cpp
1
Hardware DGX Spark / Mac M3 / used 3090s / cloud-rented

Was this worth it? Zap the article.

Value for value, no signup. Sats go straight to the writer.