Learn

SGLang: an engine for serving language models

SGLang is an open-source inference and serving engine for large language models. It loads a model and exposes an API, then schedules incoming requests, reuses shared prefixes between them, and manages the key-value (KV) cache so the GPU stays efficient under concurrent load. It is one of the serving layers you can put in front of a model on your own hardware.

At a glance

What it is
A serving engine that runs a model behind an API
Why use it
It schedules requests and reuses shared prefixes for throughput
What it serves
An HTTP endpoint your applications call
Where it runs
On your own GPU, the weights stay local
Flow

The serving engine in the middle

Requests arrive, the engine schedules them together and reuses any shared prefix, then runs the model and returns tokens. The weights stay on your hardware.

1
Application or agent sends prompts over HTTP
2
SGLang (the serving engine) schedules requests, reuses prefixes, manages the KV cache
3
Model weights on the GPU loaded once, shared across requests

What is SGLang for?

SGLang takes a model’s weights, loads them onto the GPU (graphics processing unit), and stands up an API (application programming interface) in front of them. Your code sends a prompt to an HTTP (hypertext transfer protocol) endpoint and gets tokens back, the same shape of call you would make to a hosted service, except the model runs on hardware you control.

Like other serving engines, the value is in how it handles many requests at once. It schedules incoming work so the GPU is not left idle, and it can reuse a shared prefix across requests instead of recomputing it, which helps when many calls start with the same long instruction. The key-value (KV) cache, the store of past tokens, is managed for you rather than left to grow blindly.

How do you decide between serving engines?

Several engines do this job, and the honest answer is that the right one depends on your model, your hardware, and your traffic. SGLang fits a box that serves one model to many callers and rewards you for tuning its flags. If you want a model running with no configuration at all, a lighter runner is less work. Whichever engine you pick, the model still has to fit in memory, and you will spend some time matching server flags to the hardware in front of you.

Check it yourself

curl http://localhost:30000/v1/models

A running server reports the model it has loaded. Set the port to match how you launched it.

Good fit when

  • You serve one model to many concurrent callers
  • Requests share long common prefixes you want reused
  • Throughput under load is the metric you care about
  • You are willing to tune server flags for your hardware

Less of a fit when

  • You want a model running with no configuration
  • Your hardware is modest and needs a lighter runtime
  • You only ever send one request at a time
  • You want a tool that downloads and manages models for you

Related terms

← All terms Reviewed: June 2026