A3B is a naming suffix on mixture-of-experts models that means roughly three billion active parameters per token. The model stores many more parameters in total, but for any single token it routes through only a small fraction. A 30-billion-total model marked A3B activates about 3B of those per token, so its speed and per-token compute track the 3B, not the 30B.
At a glance
What it means
About three billion active parameters per token
What it does not mean
The total parameter count, which is much larger
Why the split exists
It is a mixture-of-experts model, routing each token to a few experts
What it predicts
Per-token speed and compute track the active count, not the total
Comparison
Total parameters versus active parameters
Total (stored)
Active (per token, A3B)
What it counts
Every weight in the model
Only the weights one token routes through
Sets memory needed
Yes, all weights must be loaded
No, the whole model still loads
Sets per-token speed
No, most weights are idle per token
Yes, this is the work done per token
What does A3B in a model name mean?
A3B reads as “active 3B”, meaning about three billion active parameters per
token. It appears on mixture-of-experts (MoE) models, which split their weights
into many “experts” and route each token through only a few of them. So a model
named with a 30B total and an A3B suffix holds 30 billion parameters on disk and
in memory, but any one token only touches around 3 billion of them. The big
number is storage; the small number is the work done for each token.
Why is the active count the one to watch?
Because per-token speed and per-token reasoning track the active parameters, not
the total. The model still has to load all its weights, so memory needs follow
the total. But the arithmetic done to produce each token follows the active
count, which is why an A3B model can run far faster than its total size suggests.
It also tempers the instinct that a bigger total always wins: the extra weight
is stored knowledge sitting in idle experts, not more thinking per token. When
you compare two models, line up their active counts before you read anything into
the totals.
A3B tells you
Roughly how much compute runs per token: about 3B parameters' worth
That it is a mixture-of-experts model, not a dense one
Why it can be fast for its total size
A3B does not tell you
How much memory to load; that is the total parameter count
How capable it is overall; stored knowledge lives in the idle experts too
That a bigger total beats it; active count drives per-token reasoning