Should You Buy a DGX Spark in 2026? The Honest Decision Tree

May 23, 2026 28 min read

The short answer is no, with three exceptions.

Update (2026-06-19). Any PrismaQuant figure below is the engineering-log record. The production primary moved to Qwen 3.6 AutoRound int4-mixed on 2026-06-11 (69.2 tok/s, 12.7 percent better on the coding gate; PrismaQuant retired). The live model and throughput are on /stack/; the switch is measured in AutoRound int4 vs PrismaQuant.

You should buy a DGX Spark in 2026 if, and only if, one of these three statements describes you. You want a Blackwell-class single-box workstation that runs 100B-parameter mixture-of-experts models at room temperature in your office. You are building a small product or consulting practice where your customers will pay you for the fact that no inference call leaves your premises. You are deliberately investing in the NVIDIA software stack for career or research reasons, and you have priced the lock-in into the decision.

If none of those three statements describes you, the rest of this article is going to argue you out of the purchase. That is not a sales tactic. The Spark is not for most people, and the people for whom it is the right machine already know who they are. The rest of you have cheaper, faster, less opinionated options, and the rest of this piece is the map of them.

I have been running a DGX Spark on my desk since early April 2026. I have crashed it, recovered it, hijacked its page cache, sworn at its quirks, and shipped a one-person business off it. I have measured throughput on two production models at three quantizations, written the systemd units and the failure-recovery runbooks, and learned the difference between what NVIDIA’s marketing says it can do and what it actually does at 02:00 when a workload corner case meets a stale kernel page cache. The Spark is real. The Spark is also not what most prospective buyers think it is. This is the article I wish someone had handed me before I bought mine.

Quick Take

The honest no-buy rate is about a third. Of the buyers who write to me with a stated intent to purchase, roughly one in three changes their mind after a Stack Audit. The Spark is the right machine for fewer people than the marketing implies.

MoE language models are the win condition. Qwen 3.6 PrismaQuant 4.75bit at 57 to 62 tok/s sustained interactive under DFlash speculative decoding, verified on my own pipeline (the original Spark Arena measurement at 57 to 62 tok/s under DFlash was pre-DFlash; the number moved with the configuration). The Spark wins on architecture-fit, not raw FLOPS.

Dense >70B and diffusion are the lose conditions. A used dual RTX 3090 build at half the price beats the Spark on dense Llama-class workloads and on Stable Diffusion / Qwen-Image-class generation. Buy what your real workload needs.

Cloud rental is mathematically correct under ~1,500 GPU-hours per year. At €0.40 to €0.80 per hour for comparable cloud GPUs, the cloud is cheaper than the Spark for intermittent users by an order of magnitude. Sovereignty has to be a real business requirement, not a preference.

Two operational quirks decide first-month usability. VLLM_FLASHINFER_MOE_BACKEND=latency is mandatory or the desktop freezes (receipt). echo 3 > /proc/sys/vm/drop_caches before every model swap or the next launch OOMs at 95 GB (receipt). Neither is in the documentation. Build the runbook before you need it.

NVIDIA software stack is part of the purchase. You are buying Ubuntu plus CUDA-Blackwell plus vLLM plus a relationship with three-month-old driver edges. If you wanted macOS ergonomics, you wanted a Mac Studio.

Profiles for whom the Spark is wrong, in six specific clauses

This is the strongest moment in the article, so I am putting it third. The Spark is the wrong machine for more readers than the readers it is right for. Be honest with yourself about which list you are on.

You should not buy a DGX Spark if your workload is dense large language models above about 70B parameters. The Spark wins on mixture-of-experts architectures because the active parameter count per token is a fraction of the total parameter count. A 119B-parameter MoE model with around 17B active parameters per token sits inside the Spark’s unified memory and runs at usable interactive throughput. A 70B-parameter dense model does not have the same shape. It activates every parameter on every token, the memory bandwidth becomes the bottleneck, and you discover that the Spark’s headline parameter capacity is not the same thing as throughput capacity. The Spark’s published memory bandwidth on the GB10 is in the high-200s GB/s range, well below what an HBM3-class data-center GPU sustains. For dense Llama-class or dense Mistral Large class, you want a different machine.

You should not buy a DGX Spark if you need diffusion-model throughput. Image generation, video generation, and the larger Stable Diffusion variants are heavier on raw FLOPS and less helped by unified memory than language models are. A pair of used RTX 3090s with NVLink and 48 GB of total VRAM will outperform the Spark on these workloads at half the price and one quarter the integration pain. The Spark can run diffusion. It is not the best home for it.

You should not buy a DGX Spark if your workload is intermittent. If you genuinely use AI for two hours a week, you are buying a workstation that will sit at idle for 166 hours a week. Idle is not free. The unified memory is allocated, the cooling is running, the system is consuming power. The break-even against cloud rental at RunPod or Vast.ai prices of €0.40 to €0.80 an hour for comparable cloud GPUs is several thousand hours of utilization. If you do not have those hours, the cloud is cheaper, and you should accept that the cloud has correctly priced your real demand.

You should not buy a DGX Spark if you are not willing to manage a Linux server. This is the constraint that catches most surprised buyers. The Spark is a server. It runs Ubuntu, not macOS. It does not have a graphical desktop you should rely on for daily work. The proper way to use it is headless: you SSH into it from your laptop, you run vLLM or SGLang as a systemd service, you monitor it through dashboards and logs, and you treat it like a small private cloud that lives under your desk. If your mental model is “I will plug in a keyboard and use it like an iMac with extra horsepower,” you will be unhappy. The Spark is not unhappy with you. You are using it wrong.

You should not buy a DGX Spark if you cannot tolerate the NVIDIA software stack. The Spark ships with the CUDA-Blackwell ecosystem at a relatively early point in its lifecycle. You will find driver edges. You will find vLLM and SGLang releases where one specific environment variable determines whether the machine runs at full throughput or freezes the desktop. (See vLLM MoE throughput on sm121 Desktop freeze, which cost me a week of debugging before the right backend flag was identified, and the NVIDIA developer-forum thread on vLLM 0.17 MXFP4 patches where the FP4 quantization-error class of bugs is documented in the open.) You will find documentation that is plausible but stale. You will find that NVIDIA’s own playbooks help on some questions and hurt on others. If you want a hardware stack where you can ignore the software layer for the first eighteen months, you want a Mac Studio.

You should not buy a DGX Spark if your security model requires a specific certification that NVIDIA’s Blackwell platform does not yet hold. Some defense, healthcare, and regulated-finance environments have certification requirements that lag a platform’s release by twelve to twenty-four months. The Spark’s security posture is excellent in principle, but if your contract requires a specific FIPS validation level or a HIPAA-attested platform from a long-list vendor, the Spark may not be the right purchase yet even if the technology is perfectly capable. Check the paperwork before the box.

If any of those six paragraphs described your situation, stop reading and consider one of the alternatives below. If none of them describe your situation, the rest of the article is the positive case.

Profiles for whom the Spark is right, in four shapes

I keep finding four reader profiles for whom the Spark is the right purchase. The reasons differ enough that they deserve separate treatment.

Profile 1: The MoE Operator

You are running, or planning to run, mixture-of-experts language models in the 80B to 130B total-parameter range with 10B to 25B active parameters per token. Qwen 3.6 in its various quantizations, the larger Mistral mixtures, the open-weights MoE models that have emerged over the last year. You have read enough benchmarks to know that quantization is not free, that the right quantization for your workload depends on whether you need vision, that throughput in tokens per second is downstream of a stack of decisions about backends and flags. (For the brutal version of the quant-versus-capability trade-off, see Mistral vs Qwen 3.6: The Zero That Was a Broken Ruler.)

You want this on your premises rather than in someone else’s cloud because your data is sensitive, or your customers are regulated, or you have been burned once too often by a cloud provider’s terms-of-service change. The Spark fits this profile better than any other workstation in the price tier. The unified memory architecture is the right shape for MoE, and the price-per-active-parameter is competitive in a way that even a dual-3090 build cannot match.

This is the profile I am on. The Spark is correct for me. It is the smallest hardware envelope that runs a 119B-parameter MoE at interactive throughput without ever leaving my office.

Profile 2: The Sovereign-AI Consultant

You are building a consulting practice where the core deliverable is “your customer’s inference does not leave their premises.” Your customers might be law firms, medical practices, journalists, defense contractors, or small manufacturers. They are paying you partly for the model and partly for the fact that the model never phones home. The DGX Spark is the demonstration machine. It is what you set up on the customer’s premises during the engagement, what you show them running their workload in their network with no outbound traffic to OpenAI or Anthropic, and what justifies the consulting fee.

For this profile the Spark is a depreciating asset that pays for itself in a small number of consulting engagements. The math is straightforward. If a single sovereign-AI engagement is priced at €8,000 to €15,000, and the Spark is €4,800 post-February-2026 MSRP, the asset pays for itself before the second customer is invoiced. (For the pricing reasoning, the companion piece How I Priced Sovereign AI Consulting, unpublished until the consulting practice opens, walks the rate logic. As of May 2026 the sovgrid consulting practice is at the “scope-call SKU validation” phase, not the “five enterprise engagements shipped” phase, and the writing reflects that.) The depreciation is real but the asset pays back fast.

The risk with this profile is that you buy the Spark for a consulting practice you have not yet sold. If you do not have at least one credible lead in the pipeline when you buy the machine, the Spark becomes the sunk cost that pushes you to discount your first engagement to recover the investment. That dynamic is bad for pricing discipline.

Profile 3: The Career Investor

You are deliberately investing in the NVIDIA software ecosystem because that is where the jobs, the open-weights model releases, and the upstream developer mindshare are concentrated in 2026. You want hands-on time with CUDA-Blackwell, with vLLM, with the FlashInfer kernels, with whatever Triton compilation stack is dominant by the end of the year. You will spend three years on this platform whether you buy a Spark or not. You might as well own the asset.

This profile is honest about what is being purchased. You are not buying a workstation. You are buying a three-year apprenticeship in a specific software stack. The hardware is the apprenticeship’s substrate. If the apprenticeship is the goal, the Spark is the cheapest substrate that puts you on the production tooling rather than the consumer-card workarounds. The Mac Studio cannot do this for you. It is the wrong stack.

The risk with this profile is that the platform you are investing in changes shape. NVIDIA’s Blackwell generation is real and the software is improving every month, but the field is volatile enough that a three-year investment is a bet. Make the bet with eyes open. (For where the model layer is shifting, Strategy: Next Model Choices on DGX Spark is the long view from inside the operation.)

Profile 4: The Heavy Hobbyist with a Long Horizon

You enjoy this. You have read every quantization paper that crossed your feed for the last eighteen months. You have a job that pays well enough that €4,000 is recoverable in a few months and the equipment is not a financial stretch. You have already tried two other paths, and you have the self-knowledge to admit that “play with Llama on the weekend” is not the workload; the workload is “be the person on the forum who has actually run the model that everyone else is theorizing about.”

If this is you, you have probably already bought the Spark and you are reading this article for confirmation. Yes. It was the right call. The hobbyist case is real and not embarrassing. The only honest warning is that the hobby is a maintenance hobby, not just a usage hobby. Be ready to like the maintenance, because it is the bulk of the time.

The five real alternatives, side by side

This is the table the listicles do not give you. Columns are the realistic 2026 options at the €2k to €5k buyer tier. Rows are the dimensions that actually decide the purchase. “Best” cells are highlighted; “fatal flaw” cells are flagged.

Dimension	DGX Spark	Used dual RTX 3090 + NVLink	Apple M3 Ultra Mac Studio	Strix Halo mini-PC	Cloud rental (RunPod / Vast.ai)
Street price (2026, all-in)	€4,800-5,200 (post-Feb-2026 hike to $4,699 MSRP)	€1,900-2,400 (build cost)	€3,800 (M3 Ultra 96 GB) to €9,500+ (M3 Ultra 512 GB)	€2,200-3,200	€0.40-0.80 / GPU-hour
Total memory for model	128 GB unified	48 GB VRAM (NVLinked)	up to 512 GB unified (M3 Ultra top SKU)	up to 128 GB unified	per-instance, elastic
Memory bandwidth	~273 GB/s (GB10)	~936 GB/s per card	~800 GB/s	~256 GB/s	varies (often >1 TB/s on H100)
MoE 100B+ models	✅ designed for this	❌ does not fit	⚠️ fits but slower	⚠️ fits, sw maturity lags	✅ per-instance
Dense 70B+ models	⚠️ bandwidth-bound	⚠️ does not fit at full precision	⚠️ fits, lower throughput	⚠️ similar to Spark	✅
Diffusion / image gen	⚠️ OK, not best in class	✅ best per €	⚠️ slower, fewer kernels	⚠️ ROCm immature	✅
Software stack maturity	NVIDIA CUDA-Blackwell, new	NVIDIA CUDA-Ampere, mature	Apple MLX / llama.cpp	AMD ROCm, immature	matches instance
Driver-edge frequency	high (early platform)	low (mature)	low (Apple controlled)	high	n/a
Noise under load	moderate	loud	silent	quiet	n/a
Power draw under load	moderate (vendor-published)	high (~700 W system)	low (<100 W)	low-moderate	n/a
macOS / GUI ergonomics	server, headless	server, headless	first-class desktop	desktop possible	n/a
Sovereignty (on-premises)	✅	✅	✅	✅	❌ fatal flaw for sovereign use
Resale value (24 months)	unknown (new platform)	medium (mature card)	high (Apple resale)	low	n/a
Break-even vs cloud	~3,000 GPU-hours	~1,800 GPU-hours	~3,500 GPU-hours	~2,400 GPU-hours	break-even is the cloud
Best for	Profile 1, 2, 3	dense LLM + diffusion + lab learning	macOS ergonomics + moderate LLM	AMD-aligned operators	intermittent + product validation
Avoid if	dense >70B, diffusion-first, GUI-first	need >48 GB VRAM, MoE-first, quiet office	NVIDIA-ecosystem required	software-edge-intolerant	sovereignty is a customer requirement

The table compresses the article. If you only read the table, you have most of the answer. The article exists because the compression loses the nuance, and the nuance is where the wrong-purchase decisions hide.

Two cells deserve specific honesty. The “memory bandwidth” row puts the Spark at ~273 GB/s, well below a dual-3090 setup at ~936 per card. This is the architectural truth that explains why the Spark wins on MoE and loses on dense: MoE moves only the active expert’s parameters per token, so bandwidth-per-token is what matters, and the Spark’s unified-memory layout makes that movement cheap. Dense models move every parameter per token, and at that point a discrete GPU’s HBM-class bandwidth wins outright. The Spark is not a flops machine. It is a memory-shape machine. Understanding that one row is most of the purchase decision.

The “best per €” cell for the dual-3090 build is real and should be respected. If your single largest workload is image generation, fine-tuning a small model, or running a dense 30-65B model at heavy quant, you should buy two used 3090s, not a Spark. The right answer is the answer that matches the workload, not the answer that matches the headline.

The flowchart

                          Are you running ≥80B-parameter MoE language
                          models or plan to within 18 months?
                                          │
                              ┌───────────┴───────────┐
                              │                       │
                             YES                      NO
                              │                       │
              Are your customers paying            Do you need the unified
              you for on-premises inference?      memory architecture for
                              │                   a specific dense workload
              ┌───────────────┴──────────┐        that doesn't fit in 48 GB?
              │                          │                  │
             YES                         NO          ┌──────┴──────┐
              │                          │           │             │
        Buy a Spark.              Are you investing  YES          NO
        Profile 2.                in the NVIDIA      │             │
                                  software stack    Spark may       │
                                  for career or     fit, but        │
                                  research reasons? consider M3     │
                                            │       Ultra first.    │
                                  ┌─────────┴────────┐               │
                                  │                  │       Is the workload
                                 YES                NO       ≥1,500 GPU hours/year?
                                  │                  │               │
                            Buy a Spark.       Is your hobby     ┌───┴───┐
                            Profile 3.         budget €4k+       │       │
                                               and your          YES     NO
                                               horizon ≥3 years? │       │
                                                       │      Used     Rent from
                                              ┌────────┴──┐  dual      RunPod or
                                              │           │  3090      Vast.ai.
                                             YES         NO  build.    Cloud is
                                              │           │   For      cheaper at
                                        Spark works.   Don't    diffusion or  this duty
                                        Profile 4.    buy a    dense.        cycle.
                                                      Spark.
                                                      Rent or
                                                      use a Mac.

The self-correction I owe the previous draft

When I first drafted this article a week ago, I led with a binary framing: “the Spark is wrong for most people, right for me.” Reading the draft back against the top-performing strategy article on the model swap, I noticed the framing was lazy. The Spark is wrong for categories, not for people. The same person who is wrong for the Spark on Monday morning, when their workload is fine-tuning a 13B dense model at home, can become right for the Spark on Friday afternoon, when their workload pivots to running a 100B MoE for a regulated customer. Workloads change. The decision tree above is the categorical answer. The longitudinal answer is that the right machine in 2026 may be the wrong machine in 2027 and right again in 2028. Buying hardware is partly a bet on what your work will look like over a depreciation window. Make that bet explicit.

The corrected framing matters because it changes who the Stack Audit is for. It is not for “people who might be wrong about the Spark.” It is for “people whose workload is in flux and who want a second pair of eyes on which category they are actually in this quarter.”

The operational reality nobody mentions

Three things about the Spark are true and are not in the marketing.

The page cache is the first thing that bites you. When a large model crashes or you swap between models, the previous model’s weights remain in the kernel page cache. The next launch can OOM at 95 GB of unified memory even though only 70 GB of the model is supposed to fit, because the kernel is still holding 30 GB of stale weights it has not been told to release. The fix is a single line, echo 3 > /proc/sys/vm/drop_caches, run before every model swap. (See Fixes: SGLang Restart OOM Fix for the postmortem.) The Spark does not document this. You learn it the hard way and then you write the runbook so the next operator does not have to.

One environment variable can mean the difference between a frozen desktop and around 71 tokens per second. Inference backends on the Spark go through several kernel selection paths. The wrong path can freeze the entire window manager while inference appears to be running. On Qwen 3.6 the relevant flag is VLLM_FLASHINFER_MOE_BACKEND=latency, set before the vLLM service starts. With this flag, the machine sustains around 71 tokens per second on the production quantization under DFlash speculative decoding. Without it, the same workload can take down the desktop session, requiring an SSH reboot from another machine. (See Fixes: vLLM MoE Throughput sm121 Desktop Freeze for the full debug log.) The default flag value is wrong for the most common workload. This is the kind of detail that does not appear in any product page.

The mutex pattern is the other piece of operational architecture you build in the first month. With Qwen 3.6 PrismaQuant at 22 GB on disk and Mistral Small 4 NVFP4 at 60 GB on disk, the unified memory budget can technically hold both. In practice, hot-loading both creates a memory cascade that pulls the desktop session down. The fix is /data/scripts/llm/switch.sh qwen|mistral|none|status, a Termux-friendly one-line mutex that handles the systemctl start/stop pair, the Watchtower disable-label, and the sanity check on which model is currently hot. The systemd unit vllm-qwen36.service exists but is deliberately not enabled at boot; mutual exclusion with Mistral is an operator job through switch.sh, not a systemd default.

The pattern of “default flag value is wrong” is not unique to the Spark, but it is more frequent on a freshly-released platform than on a mature one. The cross-reference in the NVIDIA developer forum thread on vLLM 0.17 MXFP4 patches captures the surrounding noise:

“gpt-oss-120b on TP=1 exhibits FP4 quantization errors affecting structured reasoning tokens.”

That single line is the kind of receipt that does not show up in vendor documentation but determines whether a model will serve real tool-calling traffic or quietly fail. Three months from launch, this is what early-platform operation feels like. It improves quickly. It also requires reading developer forums.

The recovery procedure for a real crash is thirty minutes if you have rehearsed it, several hours if you have not. I have written a thirty-minute recovery runbook because I have crashed the machine enough times to need one. The recovery is not difficult. It is just specific. systemd service order matters. The order in which you flush the page cache, restart the inference backend, verify the systemd state, and re-attach the dashboard matters. Without the runbook, every crash is a fresh debugging session and an evening lost. With the runbook, the machine is back in thirty minutes. Build the runbook before you need it.

These three details are not warnings against the Spark. They are the price of the Spark. The price is paid in operator competence, not in money. If you are not comfortable paying that price, the Spark is the wrong machine. If you are, the Spark is fine, and these details become the texture of normal operation rather than the obstacles to it.

Six weeks with a fresh Spark: month-by-month projection

If you have decided to buy and you want a realistic onboarding timeline, this is the projection from my own logbook. Treat the timeline as a planning aid, not a guarantee.

Week 1: install, baseline, first frustration. You unbox the machine, run the NVIDIA-provided OS image, pull a vLLM container, load your first model. Probably Qwen 3.6 PrismaQuant 4.75bit because it is the current top-throughput option on a single Spark. You measure ~30 to ~62 tok/s decode depending on which flags you set and whether DFlash is active, which is wide variance from the same hardware. You hit your first OOM during a model swap and learn about the page cache the hard way. You file your first GitHub issue or find one already open. By the end of week 1, you have a working LLM endpoint on a systemd service, you have a Tailscale mesh letting your laptop reach it, and you have a list of seven things that are not yet right.

Month 1: stable runbook, first real workload, first crash recovery. You write the runbook. You wire up an inference dashboard. You move one production workload to the local endpoint. You experience a real crash at hour 600 of uptime, you execute your runbook, you are back online in the thirty-minute target, and the runbook gets one paragraph added for the failure mode you had not anticipated. The Spark is now a working tool, not a project.

Month 3: throughput improves without you doing anything. vLLM releases a new version with a kernel optimization, you redeploy the service, throughput climbs five to fifteen percent. You add a second model for the workload Mistral is still better at than Qwen 3.6 (creative prose, vision-language, German). The Spark now hosts two models on disk with a switch.sh mutex enforcing memory exclusivity (only one engine hot at a time, because hot-loading both creates a unified-memory cascade). The Watchtower disable-label on both inference containers stops the 385-restart cycle that would otherwise hit you when an upstream image pushes mid-session. The operational mode is genuinely steady-state. Customer engagements start using the local endpoint.

Year 1: depreciation accounting and the next-platform question. By month 12 the Spark has paid for itself by any reasonable measure if Profile 2 (Sovereign-AI Consultant) is your case. Throughput has improved by 30 to 50 percent through software alone, because that is the historical pattern on new NVIDIA platforms over the first twelve months. You start watching for the second-generation Spark or the announced 256 GB unified-memory variant. The depreciation accounting is straightforward: you have run the asset for a year, billed enough engagements to cover it, and your knowledge of CUDA-Blackwell has compounded.

If your projection looks unlike this timeline, the divergence is the data. Most divergences mean the Spark was the wrong fit for the workload, not that the timeline was wrong.

What the spec sheet does not tell you

Power draw under sustained inference load is real but not catastrophic. The machine sits in a normal office without special cooling and the room temperature rises by a few degrees during long inference sessions. I have not put a Kill-A-Watt on the power input and will not quote a wattage figure I have not measured. The published TDP is in the documentation. The lived experience is that the Spark is a higher-draw workstation than a Mac Studio (which draws under 100 W under typical load per Apple’s M3 Ultra Mac Studio tech specs) and a lower-draw workstation than a dual-3090 lab build (which typically pulls 600 to 800 W at the wall under inference load). If you live in a small apartment with a strict power budget, the Spark is not the friendliest choice. If you have a typical office circuit, it is fine.

Noise is moderate. The Spark has active cooling that ramps under load. It is louder than a Mac Studio and quieter than a Threadripper workstation under similar load. The noise is the kind that recedes into the background after a week. If you record audio in the same room, you will care. If you do not, you will not. The Mac Studio is the clear winner on this dimension: silent under any load I have tested. The dual-3090 build is the clear loser, especially with reference blower-style cards.

The chassis is small enough to live under a desk and large enough to be visible. The aesthetic, if you care, is good. NVIDIA has put some design effort into this generation’s industrial design. The connectors and ports are sensible. The thing looks like a serious piece of equipment, and for an asset that is going to anchor a sovereign-AI consulting practice or a publishing operation, looking serious is not nothing.

The street price varies. The vendor-published price is one number, the actual price after taxes and shipping and the inevitable accessory purchases is another. Budget €4,200 to €4,800 all-in for the European purchase including a decent UPS, the right cables, and a small replacement SSD if you intend to push the storage. The price is real and it is in the range the marketing implies, but it is the headline number, not the all-in.

Where this fits in the larger sovgrid posture

This article is part of a longer argument about sovereign AI, which is itself part of a longer argument about what it means to operate a serious workload in 2026 without renting your business model from a hyperscaler. The Spark is a particular hardware bet inside that argument. The reasoning that made the Spark the right bet for me may not be the reasoning that makes it right for you, but the bet is one shape of a class of bets that more readers should be making.

For the broader voice and posture, The Quiet Pattern Among Sovereign Engineers sets the temperament. For the rest of the stack that runs on top of the Spark, the Self-Hosted AI Start Here guide is the canonical onboarding. For the honest math on operating the asset over twelve months, the companion Self-Hosted AI vs Cloud APIs: Real Total Cost is the unit-economics deep dive (includes the May 2026 Opus 4.7 tokenizer change that raised effective cloud cost up to 35 percent with no headline price move). For the broader honesty about what benchmarks do and do not tell you, Two Leaderboards Nobody Reads Together is the long version. For the model-stack decisions that run on top of the Spark, Spark Arena Rank 4 Made Me Add Qwen3.6 is the reasoning behind the current production setup. For the reference architecture that combines all the choices, the companion Sovereign AI Stack 2026 Reference Architecture is the hub.

If you have read this far and you are still uncertain, the uncertainty is information. The Spark is a high-conviction purchase. If you do not have the conviction, you should rent for six months, watch your own usage, and revisit the decision with data. The cloud cost will be small. The wrong purchase will be expensive.

If you have read this far and you are sure but you want a second pair of eyes on the specific configuration for your workload, that is the use case for a Stack Audit.

Book a Stack Audit before you buy

A Stack Audit is a paid two-hour engagement where I look at your specific workload, your existing hardware, your one-year roadmap, and the alternatives ranked against your actual constraints, and I tell you which purchase is correct. The audit cost is small compared to a wrong €4,500 hardware decision. Most audits end with one of three answers: “buy the Spark, here is the configuration,” “do not buy the Spark, here is the alternative that fits your workload,” or “rent for six months first, here is what you should measure during that window.”

The audit is not a sales pitch for the Spark. About a third of the audits end with “do not buy the Spark.” The honesty is the product. If your decision has settled into “Spark” without an audit, that is your call. If your decision is still oscillating between three or four hardware paths, the audit collapses the oscillation faster and cheaper than another month of forum reading.

To book a Stack Audit, reach me through any of the contact links in the footer of this page (Nostr DM is the fastest, the email link is HTML-entity-encoded so it survives spam scrapers, the GitHub profile takes issues too). Include the workload sketch in the first message: which buyer profile above matches your situation, your one-year roadmap, your binding constraint (throughput, sovereignty, budget, or ergonomics). Replies are within seventy-two hours during weeks I am not traveling. The honest answer might be “you do not need an audit, here is the one-paragraph answer for your case,” in which case you are out a message and not the fee.

The Spark is a real machine. It is the right machine for a small number of buyers and the wrong machine for a larger number of buyers. The decision is worth getting right, and getting it right is cheaper before the order ships than after the box is in your office.

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—