SGLang Restart OOM Fix: Unified Memory Cleanup on GB10/DGX Spark

March 21, 2026 5 min read

Quick Take

Restarting SGLang on GB10/DGX Spark kills itself with OOM even when RAM is free

Unified memory lingers for minutes after kill, and Docker’s cleanup is broken

The fix is a 60-second wait script plus three critical SGLang flags

Docker’s Cleanup Lies

Last week this failed because I ran:

sudo docker run --rm --restart unless-stopped \
  --gpus all \
  --shm-size 16g \
  sglang/sglang-mistral-small-4:latest \
  --model-path /data/models/mistral-small-4 \
  --port 30000

The container died with Exit 137 within 30 seconds. Free memory showed 24 GB, but nvidia-smi still reported 90 GB allocated. The issue isn’t the model, it’s that Docker’s --rm and --restart flags do not play together. When both are set, Docker skips proper cleanup after a kill, so the next start inherits dirty memory pages. This means that even if you see free RAM, the GPU still thinks it owns those 90 GB until the OS finishes reclaiming unified memory, which can take minutes.

Here’s what happens in practice:

sudo docker run --rm --name sglang-test \
  --gpus all \
  sglang/sglang-mistral-small-4:latest \
  --model-path /data/models/mistral-small-4 \
  --port 30000

# Manual kill: SIGKILL
sudo docker kill sglang-test

# Free memory looks fine
free -h
#               total        used        free      shared  buff/cache   available
# Mem:           125G         35G         70G         2G         20G         88G

# But nvidia-smi still shows 90 GB allocated
nvidia-smi
# |===============================================|
# |   0  NVIDIA GB10  Off  | 00000000:01:00.0  90GB |

The container name is gone (docker ps -a shows nothing), yet the GPU memory is still tied up. Docker’s --rm removes the container, but it doesn’t wait for the OS to finish freeing the GPU’s unified memory. This is the first thing you need to know about unified memory on GB10/DGX Spark: the OS doesn’t release GPU memory instantly after a kill.

Why Unified Memory Doesn’t Free Instantly

Unified memory on NVIDIA GB10/DGX Spark uses a shared address space between CPU and GPU. When you kill a process, the OS marks the memory as free, but the GPU’s internal allocator still holds references until the driver’s cleanup thread runs. This cleanup can take 30 to 300 seconds, depending on how much memory was allocated and how aggressively the driver reclaims it.

Here’s the proof:

# Start a container that allocates 90 GB
sudo docker run --rm --name sglang-test \
  --gpus all \
  --shm-size 16g \
  sglang/sglang-mistral-small-4:latest \
  --model-path /data/models/mistral-small-4 \
  --port 30000

# Kill it immediately
sudo docker kill sglang-test

# Check nvidia-smi every 10 seconds
watch -n 10 nvidia-smi

You’ll see the memory drop from 90 GB to 0 GB only after 2-5 minutes. If you restart the container during that window, Docker inherits the dirty state and immediately triggers OOM because the GPU allocator still thinks the memory is in use.

This is why --restart and --rm are incompatible: Docker’s restart policy assumes the container can be killed and restarted cleanly, but unified memory breaks that assumption. The combination leads to a race condition where the next start inherits the previous container’s memory footprint.

The Working Startup Script

The fix is twofold: wait for memory to free, and stop using --restart. Here’s the script I now use (/data/scripts/sglang-wait-memory.sh):

#!/bin/bash
set -e

MIN_FREE_GB=70
MAX_WAIT_SECONDS=300

echo "Waiting for at least ${MIN_FREE_GB} GB free RAM..."

start_time=$(date +%s)
while true; do
  free_gb=$(free -g | awk '/^Mem:/ {print $7}')
  if [ "$free_gb" -ge "$MIN_FREE_GB" ]; then
    echo "Memory available: ${free_gb} GB"
    break
  fi
  current_time=$(date +%s)
  elapsed=$((current_time - start_time))
  if [ "$elapsed" -ge "$MAX_WAIT_SECONDS" ]; then
    echo "Timeout after ${MAX_WAIT_SECONDS} seconds"
    exit 1
  fi
  sleep 10
done

This script runs before starting SGLang. It waits until at least 70 GB of RAM is free, or exits after 5 minutes. You call it like this in sglang-fg.sh:

#!/bin/bash
set -e

sudo bash /data/scripts/sglang-wait-memory.sh

sudo docker rm sglang-mistral4 2>/dev/null || true
sudo docker run --rm --name sglang-mistral4 \
  --gpus all \
  --shm-size 16g \
  --ulimit memlock=-1 \
  sglang/sglang-mistral-small-4:latest \
  --model-path /data/models/mistral-small-4 \
  --port 30000 \
  --attention-backend triton \
  --moe-runner-backend flashinfer_cutlass \
  --speculative-algorithm EAGLE \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.75 \
  --context-length 65536

Key flags:

--attention-backend triton: Uses the Triton backend, which is lighter on memory than FlashInfer for restarts.
--mem-fraction-static 0.75: Leaves 25% headroom for the OS and other processes.
--context-length 65536: Keeps the model’s context window large but doesn’t blow up memory during startup.

Do not use --restart with --rm. Ever. If you need auto-restart, use systemd instead.

systemd Service That Doesn’t Lie

Here’s the systemd service I run (/etc/systemd/system/sglang-mistral4.service):

[Unit]
Description=SGLang Mistral Small 4
After=network.target nvidia-gpu-reset.target
Wants=nvidia-gpu-reset.target

[Service]
Type=simple
User=root
ExecStart=/data/scripts/sglang-fg.sh
Restart=on-failure
RestartSec=60
StartLimitBurst=3
StartLimitIntervalSec=600
TimeoutStartSec=600
Environment=NVIDIA_VISIBLE_DEVICES=all
Environment=NVIDIA_DRIVER_CAPABILITIES=compute,utility

[Install]
WantedBy=multi-user.target

Why these settings:

RestartSec=60: Gives the OS 60 seconds to finish unified memory cleanup after a kill.
StartLimitBurst=3 and StartLimitIntervalSec=600: Prevents OOM restart loops. If the service fails three times in 10 minutes, it stops restarting.
TimeoutStartSec=600: SGLang’s startup can take 5-8 minutes on GB10. Default 90 seconds isn’t enough.

Do not use After=nvidia-gpu-reset.target. It doesn’t exist on DGX Spark/GB10, so systemd ignores it and your service starts too early.

What I Actually Use

Mistral Small 4: The only model that fits in 90 GB with a 64k context window without OOMing on restart

sglang-fg.sh: A 60-second wait script that prevents restart races with unified memory

systemd service with RestartSec=60: The only way to run SGLang without Docker lying about memory

Flow

SGLang OOM Fix

Unified memory cleanup on GB10/DGX Spark

Problem Docker cleanup fails after kill

Diagnosis Unified memory lingers 2-5 mins

Fix 60-second wait script + flags

Result Prevents OOM on restart

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—