Learn

Tensor parallelism: one layer split across GPUs

Tensor parallelism is a way of running one model across several GPUs (graphics processing units) by splitting each layer's math across them, so the GPUs compute a single forward pass together and exchange results as they go. It lets a model too large for one GPU's memory run on several, at the cost of fast communication between them.

At a glance

What it is
Splitting each layer's computation across multiple GPUs
What it solves
A model that does not fit in one GPU's memory
The cost
GPUs must exchange data every layer, so the link between them matters
On one GPU
Set to one; there is nothing to split across
Flow

How tensor parallelism runs one pass

Each layer's work is split across the GPUs, each computes its slice, then they exchange results before the next layer. The exchange is why a fast link between GPUs matters.

1
Split each layer across the GPUs every GPU holds one slice of the layer
2
Each GPU computes its slice the heavy math runs in parallel
3
Exchange results, move to the next layer the link speed sets the floor

What is tensor parallelism?

When a model is too big to fit in one GPU (graphics processing unit), you can spread it across several. Tensor parallelism is one way to do that: instead of giving each GPU whole layers, it splits the math inside every layer across all of them. Each GPU holds a slice of the layer, computes its part of the forward pass, and then the GPUs exchange results before moving to the next layer. From the outside it looks like one model; underneath, several GPUs are doing one pass together and pooling their memory into a single budget.

What does it cost, and when do you use it?

The cost is communication. Because the GPUs trade data at every layer, the link between them sets a floor on how fast the whole thing can go. A fast interconnect makes tensor parallelism nearly free; a slow one makes it a tax. The rule of thumb: reach for it when a model genuinely does not fit one GPU and you have several with a good link between them. On a single-GPU box the tensor-parallel degree is just one, and setting it higher than the hardware supports is a common way to walk straight into an out-of-memory error. It is a capacity tool, not a free speed-up.

Tensor parallelism helps when

  • A model is too big for one GPU's memory
  • You have several GPUs with a fast link between them
  • You want to use the combined memory as one budget

It is the wrong tool when

  • The model already fits one GPU; the split just adds overhead
  • The link between GPUs is slow, so the exchanges dominate
  • You only have a single GPU; set the degree to one

Related terms

← All terms Reviewed: June 2026