vLLM is an open-source inference server for large language models. It loads a model and exposes an API, then batches incoming requests and manages the key-value (KV) cache so a single GPU serves many callers efficiently. It is one of the serving layers you can put in front of a model on your own hardware.
At a glance
What it is
An inference server that runs a model behind an API
Why use it
It batches concurrent requests to keep the GPU busy
What it serves
An HTTP endpoint your applications call
Where it runs
On your own GPU, the weights stay local
Flow
The inference server in the middle
Requests arrive, the server batches them together and manages the cache, then runs the model and streams tokens back. The weights stay on your hardware.
1
Application or agentsends prompts over HTTP
2
vLLM (the inference server)batches requests, manages the KV cache
3
Model weights on the GPUloaded once, shared across requests
What does vLLM actually do?
vLLM takes a model’s weights, loads them onto the GPU (graphics processing
unit), and stands up an API (application programming interface) in front of them.
Your code sends a prompt to an HTTP (hypertext transfer protocol) endpoint and
gets tokens back, the same shape of call you would make to a hosted service,
except the model runs on hardware you control.
The reason to use a server like this rather than calling the model directly is
what happens under load. vLLM batches many requests together so the GPU is not
left idle between them, and it manages the key-value (KV) cache, the store of
past tokens that grows with context length, so memory is used efficiently. The
result is more tokens served per second when several callers hit the model at
once.
How do you decide between serving engines?
Several engines do this job, and the right one depends on your model, your
hardware, and your traffic. vLLM fits a box that serves one model to many callers
and exposes a widely compatible API, which makes it easy to point existing client
code at. If you want a model running with no configuration at all, a lighter
runner is less work. Whichever engine you pick, the model still has to fit in
memory, and you will spend some time matching server flags to the hardware in
front of you.
Check it yourself
curl http://localhost:8000/v1/models
A running server reports the model it has loaded. Set the port to match how you launched it.
Good fit when
You serve one model to many concurrent callers
You want an API compatible with common client libraries
Throughput under load is the metric you care about
You are willing to tune server flags for your hardware
Less of a fit when
You want a model running with no configuration
Your hardware is modest and needs a lighter runtime
You only ever send one request at a time
You want a tool that downloads and manages models for you