Learn

vLLM: a server for running large language models

vLLM is an open-source inference server for large language models. It loads a model and exposes an API, then batches incoming requests and manages the key-value (KV) cache so a single GPU serves many callers efficiently. It is one of the serving layers you can put in front of a model on your own hardware.

At a glance

What it is
An inference server that runs a model behind an API
Why use it
It batches concurrent requests to keep the GPU busy
What it serves
An HTTP endpoint your applications call
Where it runs
On your own GPU, the weights stay local
Flow

The inference server in the middle

Requests arrive, the server batches them together and manages the cache, then runs the model and streams tokens back. The weights stay on your hardware.

1
Application or agent sends prompts over HTTP
2
vLLM (the inference server) batches requests, manages the KV cache
3
Model weights on the GPU loaded once, shared across requests

What does vLLM actually do?

vLLM takes a model’s weights, loads them onto the GPU (graphics processing unit), and stands up an API (application programming interface) in front of them. Your code sends a prompt to an HTTP (hypertext transfer protocol) endpoint and gets tokens back, the same shape of call you would make to a hosted service, except the model runs on hardware you control.

The reason to use a server like this rather than calling the model directly is what happens under load. vLLM batches many requests together so the GPU is not left idle between them, and it manages the key-value (KV) cache, the store of past tokens that grows with context length, so memory is used efficiently. The result is more tokens served per second when several callers hit the model at once.

How do you decide between serving engines?

Several engines do this job, and the right one depends on your model, your hardware, and your traffic. vLLM fits a box that serves one model to many callers and exposes a widely compatible API, which makes it easy to point existing client code at. If you want a model running with no configuration at all, a lighter runner is less work. Whichever engine you pick, the model still has to fit in memory, and you will spend some time matching server flags to the hardware in front of you.

Check it yourself

curl http://localhost:8000/v1/models

A running server reports the model it has loaded. Set the port to match how you launched it.

Good fit when

  • You serve one model to many concurrent callers
  • You want an API compatible with common client libraries
  • Throughput under load is the metric you care about
  • You are willing to tune server flags for your hardware

Less of a fit when

  • You want a model running with no configuration
  • Your hardware is modest and needs a lighter runtime
  • You only ever send one request at a time
  • You want a tool that downloads and manages models for you

Related terms

← All terms Reviewed: June 2026