The Unified-Memory Inference Mental Model

May 19, 2026 8 min read

Update (2026-06-19). The “Qwen 3.6 PrismaQuant” references here predate the 2026-06-11 production switch to AutoRound int4-mixed (69.2 tok/s, 12.7 percent better on the coding gate, vision retained, PrismaQuant retired). The figures are kept as the engineering-log record; the live stack is on /stack/ and the switch is measured in AutoRound int4 vs PrismaQuant.

The unified-memory architecture trades raw memory bandwidth for a single addressable memory pool that both the CPU and the GPU can read without copying. The performance consequence is that mixture-of-experts language models with sparse expert activation win disproportionately on unified-memory hardware, while dense models that move every parameter through the memory bus lose disproportionately. The mental model that produces correct purchase decisions on a DGX Spark or an Apple M3 Ultra is “what is the per-token data movement?”, not “how many gigabytes does the model occupy?”

Quick Take

The trade: unified memory has lower peak bandwidth than discrete-GPU HBM, but eliminates the PCIe transfer cost between system RAM and VRAM. Net win on workloads where data movement matters more than peak FLOPS.

MoE wins: sparse expert activation means only the active experts’ weights move through the memory bus per token. Unified memory’s lower bandwidth is not the bottleneck.

Dense models lose: every parameter activates on every token, so the entire model moves through the memory bus per token. Unified memory’s lower bandwidth becomes the bottleneck.

The number that decides: active parameters per token, not total parameters. A 119B-active dense model and a 119B-total / 17B-active MoE are not the same workload on the same hardware.

The implication: the headline “X gigabytes of memory” is the wrong way to read the spec sheet. Read the bandwidth column and the architecture-fit column.

The architectural picture

Discrete-GPU inference is a two-pool memory architecture. System RAM holds program state and the inactive model. GPU VRAM holds the active model and the running computation. Data moves between the pools through PCIe, which is slower than either pool’s internal bandwidth. The optimization target is to keep the active working set resident in VRAM and minimize PCIe traffic.

Unified-memory inference is a one-pool architecture. The same physical memory is addressable by the CPU and the GPU. There is no PCIe transfer. The model is loaded once into the unified pool and the GPU reads it directly. The optimization target is different: there is no separate “VRAM budget” to manage, just the total memory budget for the whole system.

The performance implication of the one-pool architecture depends on the workload. If the workload’s bottleneck is data movement between pools (which happens when models do not fit in VRAM and must be swapped), unified memory wins because the swap cost is zero. If the workload’s bottleneck is peak bandwidth during compute (which happens when models do fit in VRAM but the GPU is bandwidth-bound on every token), unified memory loses because the unified-pool bandwidth is lower than discrete-GPU HBM.

The bandwidth column on the spec sheet

The DGX Spark publishes a memory bandwidth on the GB10 silicon in the high-200s GB/s (vendor claim). The Apple M3 Ultra publishes around 800 GB/s (vendor claim). A dual RTX 3090 setup has roughly 936 GB/s per card. An H100 has roughly 3,350 GB/s.

The same numbers as a grid, with what each architecture is actually good for:

Hardware	Memory model	Bandwidth (vendor)	Wins on
DGX Spark (GB10)	Unified, 128 GB	~273 GB/s	100B+ MoE that exceeds any single consumer card’s VRAM
Apple M3 Ultra	Unified	~800 GB/s	unified workloads wanting more bandwidth than the Spark
Dual RTX 3090	Discrete, two-pool	~936 GB/s per card	dense 70B+ at heavy quant, spread across VRAM
H100	Discrete, two-pool	~3,350 GB/s	models that fit entirely in VRAM, raw throughput

These numbers are not directly comparable because the architectures are different. The Spark and the M3 Ultra are unified-memory, so the bandwidth is shared across the whole system; the 3090 and the H100 are discrete-GPU, so the bandwidth applies to VRAM only and any PCIe transfer is on top.

For inference workloads where the model fits entirely in the discrete GPU’s VRAM, the discrete card’s higher bandwidth wins on raw token throughput. For inference workloads that exceed a single discrete card’s VRAM, the unified-memory architecture wins by not requiring the swap.

The Spark is the clearest demonstration of this trade because the unified memory is 128 GB, which fits language models that no single 24 GB discrete card can hold. The architecture is the right shape for “I want to run a 100B+ MoE model on a single box.”

Why MoE wins on unified memory

Mixture-of-experts models route each token through a small subset of the total parameters. A model with 35B total parameters and 3B active per token (Qwen 3.6 PrismaQuant) moves only the 3B active parameters’ weight through the memory bus per token, not the full 35B.

On unified memory, the per-token data movement is 3B parameters x precision-bits. On discrete GPU with the same model, the same data movement happens, but the model has to fit in VRAM first (which a 35B model in INT4 quantization does, comfortably).

The interesting case is the 100B+ MoE class. A 119B-total / 17B-active MoE model (Qwen-Coder-Next or similar) needs ~30 GB at INT4 quantization for the weights, which exceeds a single 24 GB GPU’s VRAM. On a single-discrete-GPU build, you cannot run this model. On unified memory with 128 GB, you can, and the per-token movement is only 17B-active x precision, which the unified-memory bandwidth handles at acceptable throughput.

This is the architecture-fit win: MoE 100B+ runs on unified memory in a way that discrete-GPU consumer hardware in the same price tier cannot match.

Why dense models lose on unified memory

Dense models activate every parameter on every token. A 70B dense model in INT4 quantization is roughly 35 GB on disk, moves 35 GB through the memory bus per token (modulo cache effects), and demands as much memory bandwidth as the hardware can provide.

On a dual-3090 build with NVLink, the 35 GB does not fit in one card’s 24 GB VRAM but does fit across two cards with NVLink bridging. The aggregate bandwidth is the per-card bandwidth (~936 GB/s) modulo the NVLink overhead. The model runs.

On unified memory, the 35 GB does fit (the Spark’s 128 GB is plenty), but the bandwidth (~273 GB/s) becomes the bottleneck. The dense model runs, but at lower throughput than the discrete-GPU setup with comparable raw FLOPS.

This is the architecture-fit loss: dense 70B+ runs faster on a discrete-GPU build than on unified memory in the same price tier.

The active-parameters-per-token number

The single number that decides the architecture-fit question is “active parameters per token.” This is not the total parameter count. It is the count of parameters that participate in computing each output token.

For dense models, active = total. A 70B dense is 70B active.

For MoE models with k experts active out of N total, active = (k/N) x total. A 119B MoE with 17B active per token is 17B active.

For sparse-activation specialized architectures (mixture of depths, attention-sparse variants), the math varies but the principle is the same: count the parameters that the forward pass actually moves.

The architecture-fit rule of thumb: if (active parameters x precision-bits / 8) < (hardware memory bandwidth / desired tokens-per-second), the hardware can sustain that throughput. The math is rough but produces correct first-cut decisions about which architecture (unified or discrete) matches which workload.

Practical implications for the Spark

The Spark is the right hardware for workloads where active parameters per token are small relative to total parameters. Qwen 3.6 PrismaQuant (3B active / 35B total) is the canonical case. Qwen-Coder-Next (3B active / 80B total) is similar. Future MoE models with active counts in the 10-30B range and total counts up to roughly 200B are also good fits.

The Spark is the wrong hardware for workloads where active equals total and total is large. Mistral Small 4 NVFP4 is 119B dense; the Spark runs it at roughly 29-36 tok/s depending on EAGLE configuration, which is good throughput by absolute terms but the architecture is fighting the bandwidth ceiling. A dual-3090 build at the same price runs slightly faster on dense 70B at heavy quant. For dense Llama 3 70B class, the dual-3090 is the better answer.

The Spark is the wrong hardware for diffusion models, which are FLOPS-bound rather than bandwidth-bound. The Spark’s compute is good but not class-leading; a discrete GPU at the same price gives more FLOPS per dollar for diffusion workloads. (See DGX Spark vs Apple M3 Ultra Mac Studio for the broader comparison.)

Where this fits

For the practical implications, see Mistral vs Qwen 3.6 vs GLM-5 on a Single DGX Spark and Should You Buy a DGX Spark in 2026?. For the quantization layer that interacts with the memory model, see NVFP4 Quantization Explained. For why owning this layer is worth the bandwidth penalty in the first place, see the week a cloud vendor’s models were revoked for an entire continent: the same local box that loses on peak FLOPS keeps serving on the day the cloud option disappears.

Read the next mental-model piece

The follow-up on the inference-engine kernel selection logic, which is the layer that translates the mental model above into actual throughput, is the EAGLE Speculative Decoding article (forthcoming). Read it next.

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—