Learn

TFLOPS: a compute-rate spec, not a speed promise

TFLOPS (tera floating-point operations per second) is a measure of how many trillion floating-point calculations a processor can perform each second. It is a peak compute-rate spec from the data sheet, not a measured workload result, and it rarely predicts the real tokens-per-second of a running model.

At a glance

What it is
Trillions of floating-point operations per second a chip can do
What it measures
Peak compute rate on the spec sheet, not a real workload
Why it misleads
Token speed is usually set by memory bandwidth, not compute
How to read it
A rough ceiling, never a promise of tokens per second
Flow

Why TFLOPS is not tokens per second

A high compute rate is a ceiling, not a guarantee. For most local language-model work the real limit sits a step earlier, at how fast weights stream out of memory.

1
TFLOPS on the spec sheet peak compute, measured in ideal conditions
2
Memory bandwidth for most token generation this is the real bottleneck
3
Tokens per second you measure the number that actually matters, set by the slowest step

What does TFLOPS measure?

TFLOPS stands for tera floating-point operations per second: trillions of floating-point calculations per second. It is a peak figure from the data sheet, the most arithmetic a chip could do in a second under ideal conditions. As a rough measure of raw compute, bigger is broadly better, and it is a fair way to compare compute-heavy work like training or image generation.

Why does it rarely predict tokens per second?

Here is the trap. For most local language-model work, the chip is not waiting on maths. It is waiting on memory. To produce each token it has to read the model’s weights out of memory, and on typical hardware that read is slower than the arithmetic that follows. The workload is memory-bound, not compute-bound. So a chip with a high TFLOPS figure can still produce tokens slowly, because the bottleneck sits a step earlier, at memory bandwidth.

This is why you should treat a TFLOPS number as a ceiling, not a forecast. It tells you what the chip could do if compute were the limit. It does not tell you that compute is the limit for your model. The only honest way to learn your real tokens per second is to run your model and measure it. The spec sheet sets an upper bound and stops there.

TFLOPS is useful for

  • A rough sense of a chip's peak arithmetic ceiling
  • Comparing compute-heavy work like training or image generation
  • Spotting a generational jump in raw maths capability

TFLOPS will not tell you

  • How many tokens per second a given model will actually produce
  • Whether your workload is compute-bound or memory-bound (usually the latter)
  • Anything about memory capacity, which caps what you can load at all

Related terms

← All terms Reviewed: June 2026