MoE: many small experts, one bill at a time : Learn

A mixture of experts (MoE) is a model architecture that splits the network into many specialist sub-networks, the experts, and routes each token to only a small subset of them. The model has a large total parameter count but activates only a fraction per token, so it runs far cheaper than its size suggests.

What is a mixture of experts?

A dense model runs every parameter for every token. A mixture of experts (MoE) splits the network into many specialist sub-networks, called experts, and adds a small router that picks only a few of them for each token. So a model might hold a very large total parameter count, but only a small active fraction does work on any given token. You get something close to the quality of a huge dense model while paying, in compute, for a much smaller one. That is the whole pitch, and it mostly holds up.

Where is the catch?

In memory. The router only runs a few experts per token, but it could pick any of them next token, so all the experts have to be loaded and waiting. Your compute bill scales with the active count; your memory bill scales with the total. On a unified-memory box like the DGX Spark, where the model shares one pool with the operating system, that total is what decides whether the model fits at all. A mixture of experts is cheaper to run than its size suggests and exactly as expensive to hold. Plan the fit against total parameters, plan the speed against active ones.

MoE: many small experts, one bill at a time

At a glance

How a token moves through a mixture of experts

What is a mixture of experts?

Where is the catch?

MoE saves you

MoE does not save you

Related terms

At a glance

How a token moves through a mixture of experts

What is a mixture of experts?

Where is the catch?

MoE saves you

MoE does not save you

Related terms

Go deeper