Tensor parallelism is a way of running one model across several GPUs (graphics processing units) by splitting each layer's math across them, so the GPUs compute a single forward pass together and exchange results as they go. It lets a model too large for one GPU's memory run on several, at the cost of fast communication between them.
At a glance
What it is
Splitting each layer's computation across multiple GPUs
What it solves
A model that does not fit in one GPU's memory
The cost
GPUs must exchange data every layer, so the link between them matters
On one GPU
Set to one; there is nothing to split across
Flow
How tensor parallelism runs one pass
Each layer's work is split across the GPUs, each computes its slice, then they exchange results before the next layer. The exchange is why a fast link between GPUs matters.
1
Split each layer across the GPUsevery GPU holds one slice of the layer
2
Each GPU computes its slicethe heavy math runs in parallel
3
Exchange results, move to the next layerthe link speed sets the floor
What is tensor parallelism?
When a model is too big to fit in one GPU (graphics processing unit), you can
spread it across several. Tensor parallelism is one way to do that: instead of
giving each GPU whole layers, it splits the math inside every layer across all of
them. Each GPU holds a slice of the layer, computes its part of the forward pass,
and then the GPUs exchange results before moving to the next layer. From the
outside it looks like one model; underneath, several GPUs are doing one pass
together and pooling their memory into a single budget.
What does it cost, and when do you use it?
The cost is communication. Because the GPUs trade data at every layer, the link
between them sets a floor on how fast the whole thing can go. A fast interconnect
makes tensor parallelism nearly free; a slow one makes it a tax. The rule of
thumb: reach for it when a model genuinely does not fit one GPU and you have
several with a good link between them. On a single-GPU box the tensor-parallel
degree is just one, and setting it higher than the hardware supports is a common
way to walk straight into an out-of-memory error. It is a capacity tool, not a
free speed-up.
Tensor parallelism helps when
A model is too big for one GPU's memory
You have several GPUs with a fast link between them
You want to use the combined memory as one budget
It is the wrong tool when
The model already fits one GPU; the split just adds overhead
The link between GPUs is slow, so the exchanges dominate