CUDA (Compute Unified Device Architecture) is NVIDIA's platform and programming interface for running general-purpose software on its graphics processors. Code written for it executes on the thousands of small cores a graphics processor has, instead of on the handful of cores in a central processor. Nearly every engine that serves a local large language model today, from the model framework down to the math libraries, is built on CUDA, so it is the layer your inference actually runs on.
At a glance
What it is
NVIDIA's platform for running general-purpose software on its graphics processors
Why it matters
Almost all local large-language-model inference is built on top of it
Who controls it
NVIDIA alone; it runs on NVIDIA hardware and nobody else's
The catch
It is a single-vendor dependency you cannot self-host your way out of
Stack
Where CUDA sits in a local inference stack
Your sovereign choices live at the top. CUDA and the silicon under it are the one layer you do not get to swap, which is why it is the substrate the rest of the stack rests on.
3
NVIDIA graphics processorthe silicon the whole stack runs on
2
CUDA platform and math librariesNVIDIA's layer that turns that code into work the chip can do
1
Your model and serving enginethe part you choose, host, and control
Why is CUDA the layer everything runs on?
A central processor runs a few powerful cores, one task after another. A graphics
processor runs thousands of small cores at once, which is exactly the shape of the
math a language model needs. CUDA is what lets ordinary software reach those cores.
Without it, a graphics processor only draws pictures; with it, the same chip becomes
a general-purpose machine for the matrix math behind every token.
That is why CUDA sits under your whole stack. The serving engine you picked, the
model framework it uses, and the attention and math libraries below that are all
written against CUDA. You can swap the model, swap the engine, change the
quantization, and tune the context length. The layer that turns all of it into work
the chip can do stays the same. When local inference “just works” on an NVIDIA box,
this is the reason, and it is also the reason the same code will not run on hardware
from another vendor.
The honest sovereignty angle
Self-hosting buys back a great deal. You hold the weights, the prompts, the logs,
and the network path. Nothing leaves your machine that you did not send. But CUDA is
the one piece you do not get to bring home. It is NVIDIA’s, it runs on NVIDIA
silicon, and there is no open replacement you can stand up on your own hardware to
take its place. So the picture is honest rather than tidy: you move the dependency,
you do not delete it. The model is yours, the box is yours, and the substrate they
both rest on still belongs to a single company. Worth knowing before you call the
stack fully sovereign, and worth weighing when you decide which hardware to buy in
the first place.
Check it yourself
nvcc --version
This reports the CUDA compiler version your toolkit ships. If the command is missing but nvidia-smi works, the driver is present without the full toolkit; nvidia-smi itself prints the maximum CUDA version the installed driver supports.
What CUDA gives you
A mature, fast path that every major inference engine targets first
Math and attention libraries tuned for the exact NVIDIA chip you own
Tooling and documentation that the rest of the ecosystem assumes you have
What it costs you
A hard tie to one vendor's hardware, with no self-hosted substitute
A layer you run but do not control, sitting under everything you do control
Migration friction if you ever want to leave NVIDIA, since the stack is built on it