How a 90GB model kept dying after restart despite free RAM, and the exact commands that finally fixed it

SGLang Restart OOM Fix: Unified Memory Pitfalls on ARM64 GPUs

Quick Take

  • SGLang on DGX Spark/GB10 dies with SIGKILL after restart even when RAM is free
  • Docker’s --rm and --restart fight each other like cats in a bag
  • Unified memory needs 60+ seconds to fully release after docker kill

I watched a 90GB Mistral Small 4 container crash three times in a row after a simple restart. Free RAM? Plenty. GPU memory? Cleaned up. Yet Docker killed it with SIGKILL every time. Turns out unified memory on ARM64 GPUs plays by different rules than classic VRAM + RAM. Here’s what broke, why it broke, and the exact commands that fixed it.


Why SGLang Dies After Restart Despite Free RAM

The first restart looked innocent:

docker restart sglang-mistral4

Exit code 137. SIGKILL. No logs. Just gone.

Free memory showed 32GB available:

free -h
# Mem:           128G         96G         32G         2G          0G         28G

But the GPU still held ~90GB of unified memory. Why?

DGX Spark and GB10 use unified memory architectures. When you docker kill a container, the OS marks the memory as free but doesn’t actually release it to the pool for seconds to minutes. Docker’s --rm flag tells the runtime to clean up, but it races against the OS’s memory release. Add --restart unless-stopped and you’ve got a guaranteed OOM cascade.


The Three Failed Fixes Before the Real Solution

I tried everything. Every flag change made things worse.

First, --attention-backend flashinfer seemed promising:

docker run --rm --gpus all \
  --attention-backend flashinfer \
  sglang/sglang:v0.3.0-mistral-small-4 \
  --model-path /models/mistral-small-4

But flashinfer’s initialization allocates an extra 2GB on top of the model. With 90GB already in use, that pushed total memory over the edge after restart. OOM city.

Next, --mem-fraction-static 0.88:

docker run --rm --gpus all \
  --mem-fraction-static 0.88 \
  sglang/sglang:v0.3.0-mistral-small-4

Too tight. The container started fine but crashed during warmup when memory pressure spiked. 0.88 left no headroom for the OS’s unified memory lag.

Then I tried speculative decoding flags:

docker run --rm --gpus all \
  --speculative-eagle-topk 4 \
  sglang/sglang:v0.3.0-mistral-small-4

But SGLANG_ENABLE_SPEC_V2=True only supports topk=1. The mismatch caused silent failures and memory leaks. The container would run for a few minutes then die.

Finally, --restart unless-stopped with --rm:

docker run --rm --restart unless-stopped --gpus all \
  sglang/sglang:v0.3.0-mistral-small-4

Docker’s documentation says these flags are incompatible. The runtime tries to clean up while also planning to restart, leaving memory in limbo. The result? SIGKILL every time.


The Working SGLang Command Line

After all the failures, this configuration works:

docker run --rm --gpus all \
  --attention-backend triton \
  --moe-runner-backend flashinfer_cutlass \
  --speculative-algorithm EAGLE \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.75 \
  --context-length 65536 \
  sglang/sglang:v0.3.0-mistral-small-4

Key choices:

Note: This exact command assumes you’ve pre-downloaded the model to /models/mistral-small-4. Adjust paths accordingly.


Adding Memory Wait Before Restart

The real fix wasn’t just the command line. It was waiting for memory to fully release.

I added a wait script at /data/scripts/sglang-wait-memory.sh:

#!/bin/bash
TARGET_FREE=${1:-70}  # GB
MAX_WAIT=${2:-300}    # seconds

start=$(date +%s)
while true; do
  free_gb=$(free -g | awk '/^Mem:/ {print $7}')
  if [ "$free_gb" -ge "$TARGET_FREE" ]; then
    echo "Memory available: ${free_gb}GB"
    break
  fi
  elapsed=$(( $(date +%s) - start ))
  if [ $elapsed -ge $MAX_WAIT ]; then
    echo "Timeout waiting for memory"
    exit 1
  fi
  sleep 5
done

Then modified the service to call it:

ExecStartPre=/data/scripts/sglang-wait-memory.sh 70 300
ExecStart=/usr/bin/docker run --rm --gpus all \
  --attention-backend triton \
  --moe-runner-backend flashinfer_cutlass \
  --speculative-algorithm EAGLE \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.75 \
  --context-length 65536 \
  sglang/sglang:v0.3.0-mistral-small-4

The systemd service now waits up to 5 minutes for memory to stabilize before starting the container.


Watch Out: These Landmines Will Bite You

Gotcha: nvidia-gpu-reset.target doesn’t exist on DGX Spark/GB10. I tried adding:

After=nvidia-gpu-reset.target

to the service file. Systemd ignored it. No error, just no effect. The target simply isn’t present on ARM64 GPUs. Remove it before you waste hours debugging.

Warning: Don’t set RestartSec too low. I started with 30 seconds:

RestartSec=30

But unified memory on ARM64 can take 60+ seconds to fully release. With 30 seconds, the container would start before memory was ready, triggering OOM again. Bump it to 60 seconds minimum.

Note: StartLimitBurst=3 in 600 seconds prevents restart loops. Without it, Docker would try restarting every 30 seconds, each time failing and eating memory until the system collapsed. Set it explicitly:

StartLimitBurst=3
StartLimitIntervalSec=600

The Only Reliable Restart Sequence

Never restart SGLang immediately after killing it. Always follow this sequence:

docker kill sglang-mistral4
# Wait for memory to free
/data/scripts/sglang-wait-memory.sh 70 300
# Then start
docker run --rm --gpus all \
  --attention-backend triton \
  --moe-runner-backend flashinfer_cutlass \
  --speculative-algorithm EAGLE \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.75 \
  --context-length 65536 \
  sglang/sglang:v0.3.0-mistral-small-4

Automate it with a wrapper script if you restart often. Manual restarts are the fastest way to hit OOM.


What I Actually Use

  • Mistral Small 4: The only model that fits in unified memory without constant crashes
  • DGX Spark with GB10: ARM64 GPU that needs patience for memory release
  • Triton backend: Saves 2GB per restart compared to flashinfer
Stack

SGLang OOM Fix

Unified memory pitfalls on ARM64 GPUs

5
Fix Triton + FlashInfer Cutlass
4
Model 90GB Mistral Small 4
3
Docker Restart vs --rm conflict
2
OS Unified memory release delay
1
Hardware DGX Spark/GB10 ARM64 GPUs
Illustration: SGLang Restart OOM Fix: Unified Memory Pitfalls on ARM64 GPUs