SGLang is an open-source inference and serving engine for large language models. It loads a model and exposes an API, then schedules incoming requests, reuses shared prefixes between them, and manages the key-value (KV) cache so the GPU stays efficient under concurrent load. It is one of the serving layers you can put in front of a model on your own hardware.
At a glance
What it is
A serving engine that runs a model behind an API
Why use it
It schedules requests and reuses shared prefixes for throughput
What it serves
An HTTP endpoint your applications call
Where it runs
On your own GPU, the weights stay local
Flow
The serving engine in the middle
Requests arrive, the engine schedules them together and reuses any shared prefix, then runs the model and returns tokens. The weights stay on your hardware.
1
Application or agentsends prompts over HTTP
2
SGLang (the serving engine)schedules requests, reuses prefixes, manages the KV cache
3
Model weights on the GPUloaded once, shared across requests
What is SGLang for?
SGLang takes a model’s weights, loads them onto the GPU (graphics processing
unit), and stands up an API (application programming interface) in front of
them. Your code sends a prompt to an HTTP (hypertext transfer protocol)
endpoint and gets tokens back, the same shape of call you would make to a hosted
service, except the model runs on hardware you control.
Like other serving engines, the value is in how it handles many requests at
once. It schedules incoming work so the GPU is not left idle, and it can reuse a
shared prefix across requests instead of recomputing it, which helps when many
calls start with the same long instruction. The key-value (KV) cache, the store
of past tokens, is managed for you rather than left to grow blindly.
How do you decide between serving engines?
Several engines do this job, and the honest answer is that the right one depends
on your model, your hardware, and your traffic. SGLang fits a box that serves one
model to many callers and rewards you for tuning its flags. If you want a model
running with no configuration at all, a lighter runner is less work. Whichever
engine you pick, the model still has to fit in memory, and you will spend some
time matching server flags to the hardware in front of you.
Check it yourself
curl http://localhost:30000/v1/models
A running server reports the model it has loaded. Set the port to match how you launched it.
Good fit when
You serve one model to many concurrent callers
Requests share long common prefixes you want reused
Throughput under load is the metric you care about
You are willing to tune server flags for your hardware
Less of a fit when
You want a model running with no configuration
Your hardware is modest and needs a lighter runtime
You only ever send one request at a time
You want a tool that downloads and manages models for you