A mixture of experts (MoE) is a model architecture that splits the network into many specialist sub-networks, the experts, and routes each token to only a small subset of them. The model has a large total parameter count but activates only a fraction per token, so it runs far cheaper than its size suggests.
At a glance
What it is
A model split into many experts, only a few used per token
Total vs active
Large total parameter count, small active count per token
Why it is used
Big-model quality at a fraction of the compute per token
The catch
All experts must sit in memory, even the unused ones
Flow
How a token moves through a mixture of experts
A router picks a small subset of experts for each token. Only the chosen ones do work, but every expert still has to be loaded in memory.
1
Token arrivesone piece of the input, on its way through the layer
2
Router picks a few expertsa small subset of the many available, chosen per token
3
Only those experts computethe rest sit idle for this token, saving compute
4
All experts stay in memoryloaded whether used or not, so the memory bill is the full size
What is a mixture of experts?
A dense model runs every parameter for every token. A mixture of experts (MoE)
splits the network into many specialist sub-networks, called experts, and adds a
small router that picks only a few of them for each token. So a model might hold
a very large total parameter count, but only a small active fraction does work on
any given token. You get something close to the quality of a huge dense model
while paying, in compute, for a much smaller one. That is the whole pitch, and it
mostly holds up.
Where is the catch?
In memory. The router only runs a few experts per token, but it could pick any of
them next token, so all the experts have to be loaded and waiting. Your compute
bill scales with the active count; your memory bill scales with the total. On a
unified-memory box like the DGX Spark, where the model shares one pool with the
operating system, that total is what decides whether the model fits at all. A
mixture of experts is cheaper to run than its size suggests and exactly as
expensive to hold. Plan the fit against total parameters, plan the speed against
active ones.
MoE saves you
Compute per token, since only a few experts run each time
Speed, often faster than a dense model of the same total size
A path to large total capacity without dense-model running cost
MoE does not save you
Memory: every expert must be loaded, used or not
Simplicity; routing adds moving parts that can imbalance
The OOM risk; total size still has to fit in your pool