SGLang Restart OOM Fix: Unified Memory Pitfalls on ARM64 GPUs
Quick Take
- SGLang on DGX Spark/GB10 dies with SIGKILL after restart even when RAM is free
- Docker’s
--rmand--restartfight each other like cats in a bag- Unified memory needs 60+ seconds to fully release after
docker kill
I watched a 90GB Mistral Small 4 container crash three times in a row after a simple restart. Free RAM? Plenty. GPU memory? Cleaned up. Yet Docker killed it with SIGKILL every time. Turns out unified memory on ARM64 GPUs plays by different rules than classic VRAM + RAM. Here’s what broke, why it broke, and the exact commands that fixed it.
Why SGLang Dies After Restart Despite Free RAM
The first restart looked innocent:
docker restart sglang-mistral4
Exit code 137. SIGKILL. No logs. Just gone.
Free memory showed 32GB available:
free -h
# Mem: 128G 96G 32G 2G 0G 28G
But the GPU still held ~90GB of unified memory. Why?
DGX Spark and GB10 use unified memory architectures. When you docker kill a container, the OS marks the memory as free but doesn’t actually release it to the pool for seconds to minutes. Docker’s --rm flag tells the runtime to clean up, but it races against the OS’s memory release. Add --restart unless-stopped and you’ve got a guaranteed OOM cascade.
The Three Failed Fixes Before the Real Solution
I tried everything. Every flag change made things worse.
First, --attention-backend flashinfer seemed promising:
docker run --rm --gpus all \
--attention-backend flashinfer \
sglang/sglang:v0.3.0-mistral-small-4 \
--model-path /models/mistral-small-4
But flashinfer’s initialization allocates an extra 2GB on top of the model. With 90GB already in use, that pushed total memory over the edge after restart. OOM city.
Next, --mem-fraction-static 0.88:
docker run --rm --gpus all \
--mem-fraction-static 0.88 \
sglang/sglang:v0.3.0-mistral-small-4
Too tight. The container started fine but crashed during warmup when memory pressure spiked. 0.88 left no headroom for the OS’s unified memory lag.
Then I tried speculative decoding flags:
docker run --rm --gpus all \
--speculative-eagle-topk 4 \
sglang/sglang:v0.3.0-mistral-small-4
But SGLANG_ENABLE_SPEC_V2=True only supports topk=1. The mismatch caused silent failures and memory leaks. The container would run for a few minutes then die.
Finally, --restart unless-stopped with --rm:
docker run --rm --restart unless-stopped --gpus all \
sglang/sglang:v0.3.0-mistral-small-4
Docker’s documentation says these flags are incompatible. The runtime tries to clean up while also planning to restart, leaving memory in limbo. The result? SIGKILL every time.
The Working SGLang Command Line
After all the failures, this configuration works:
docker run --rm --gpus all \
--attention-backend triton \
--moe-runner-backend flashinfer_cutlass \
--speculative-algorithm EAGLE \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.75 \
--context-length 65536 \
sglang/sglang:v0.3.0-mistral-small-4
Key choices:
- Triton backend avoids flashinfer’s extra 2GB allocation
flashinfer_cutlassfor MoE layers keeps memory lean- EAGLE with
topk=1matches the spec v2 requirement mem-fraction-static 0.75leaves 25% headroom for the OS’s unified memory lagcontext-length 65536fits Mistral Small 4’s needs without overcommitting
Note: This exact command assumes you’ve pre-downloaded the model to /models/mistral-small-4. Adjust paths accordingly.
Adding Memory Wait Before Restart
The real fix wasn’t just the command line. It was waiting for memory to fully release.
I added a wait script at /data/scripts/sglang-wait-memory.sh:
#!/bin/bash
TARGET_FREE=${1:-70} # GB
MAX_WAIT=${2:-300} # seconds
start=$(date +%s)
while true; do
free_gb=$(free -g | awk '/^Mem:/ {print $7}')
if [ "$free_gb" -ge "$TARGET_FREE" ]; then
echo "Memory available: ${free_gb}GB"
break
fi
elapsed=$(( $(date +%s) - start ))
if [ $elapsed -ge $MAX_WAIT ]; then
echo "Timeout waiting for memory"
exit 1
fi
sleep 5
done
Then modified the service to call it:
ExecStartPre=/data/scripts/sglang-wait-memory.sh 70 300
ExecStart=/usr/bin/docker run --rm --gpus all \
--attention-backend triton \
--moe-runner-backend flashinfer_cutlass \
--speculative-algorithm EAGLE \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.75 \
--context-length 65536 \
sglang/sglang:v0.3.0-mistral-small-4
The systemd service now waits up to 5 minutes for memory to stabilize before starting the container.
Watch Out: These Landmines Will Bite You
Gotcha: nvidia-gpu-reset.target doesn’t exist on DGX Spark/GB10. I tried adding:
After=nvidia-gpu-reset.target
to the service file. Systemd ignored it. No error, just no effect. The target simply isn’t present on ARM64 GPUs. Remove it before you waste hours debugging.
Warning: Don’t set RestartSec too low. I started with 30 seconds:
RestartSec=30
But unified memory on ARM64 can take 60+ seconds to fully release. With 30 seconds, the container would start before memory was ready, triggering OOM again. Bump it to 60 seconds minimum.
Note: StartLimitBurst=3 in 600 seconds prevents restart loops. Without it, Docker would try restarting every 30 seconds, each time failing and eating memory until the system collapsed. Set it explicitly:
StartLimitBurst=3
StartLimitIntervalSec=600
The Only Reliable Restart Sequence
Never restart SGLang immediately after killing it. Always follow this sequence:
docker kill sglang-mistral4
# Wait for memory to free
/data/scripts/sglang-wait-memory.sh 70 300
# Then start
docker run --rm --gpus all \
--attention-backend triton \
--moe-runner-backend flashinfer_cutlass \
--speculative-algorithm EAGLE \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.75 \
--context-length 65536 \
sglang/sglang:v0.3.0-mistral-small-4
Automate it with a wrapper script if you restart often. Manual restarts are the fastest way to hit OOM.
What I Actually Use
- Mistral Small 4: The only model that fits in unified memory without constant crashes
- DGX Spark with GB10: ARM64 GPU that needs patience for memory release
- Triton backend: Saves 2GB per restart compared to flashinfer
SGLang OOM Fix
Unified memory pitfalls on ARM64 GPUs