Tensor parallelism: one layer split across GPUs : Learn

Tensor parallelism is a way of running one model across several GPUs (graphics processing units) by splitting each layer's math across them, so the GPUs compute a single forward pass together and exchange results as they go. It lets a model too large for one GPU's memory run on several, at the cost of fast communication between them.

What is tensor parallelism?

When a model is too big to fit in one GPU (graphics processing unit), you can spread it across several. Tensor parallelism is one way to do that: instead of giving each GPU whole layers, it splits the math inside every layer across all of them. Each GPU holds a slice of the layer, computes its part of the forward pass, and then the GPUs exchange results before moving to the next layer. From the outside it looks like one model; underneath, several GPUs are doing one pass together and pooling their memory into a single budget.

What does it cost, and when do you use it?

The cost is communication. Because the GPUs trade data at every layer, the link between them sets a floor on how fast the whole thing can go. A fast interconnect makes tensor parallelism nearly free; a slow one makes it a tax. The rule of thumb: reach for it when a model genuinely does not fit one GPU and you have several with a good link between them. On a single-GPU box the tensor-parallel degree is just one, and setting it higher than the hardware supports is a common way to walk straight into an out-of-memory error. It is a capacity tool, not a free speed-up.

Tensor parallelism: one layer split across GPUs

At a glance

How tensor parallelism runs one pass

What is tensor parallelism?

What does it cost, and when do you use it?

Tensor parallelism helps when

It is the wrong tool when

Related terms

At a glance

How tensor parallelism runs one pass

What is tensor parallelism?

What does it cost, and when do you use it?

Tensor parallelism helps when

It is the wrong tool when

Related terms

Go deeper