vLLM: a server for running large language models : Learn

vLLM is an open-source inference server for large language models. It loads a model and exposes an API, then batches incoming requests and manages the key-value (KV) cache so a single GPU serves many callers efficiently. It is one of the serving layers you can put in front of a model on your own hardware.

What does vLLM actually do?

vLLM takes a model’s weights, loads them onto the GPU (graphics processing unit), and stands up an API (application programming interface) in front of them. Your code sends a prompt to an HTTP (hypertext transfer protocol) endpoint and gets tokens back, the same shape of call you would make to a hosted service, except the model runs on hardware you control.

The reason to use a server like this rather than calling the model directly is what happens under load. vLLM batches many requests together so the GPU is not left idle between them, and it manages the key-value (KV) cache, the store of past tokens that grows with context length, so memory is used efficiently. The result is more tokens served per second when several callers hit the model at once.

How do you decide between serving engines?

Several engines do this job, and the right one depends on your model, your hardware, and your traffic. vLLM fits a box that serves one model to many callers and exposes a widely compatible API, which makes it easy to point existing client code at. If you want a model running with no configuration at all, a lighter runner is less work. Whichever engine you pick, the model still has to fit in memory, and you will spend some time matching server flags to the hardware in front of you.

vLLM: a server for running large language models

At a glance

The inference server in the middle

What does vLLM actually do?

How do you decide between serving engines?

Check it yourself

Good fit when

Less of a fit when

Related terms

At a glance

The inference server in the middle

What does vLLM actually do?

How do you decide between serving engines?

Check it yourself

Good fit when

Less of a fit when

Related terms

Go deeper