Three Silent Failures That Would Have Killed My Self-Hosted AI Stack
SSH silently broke on reboot because I pasted two config lines into one. The port went dark. Swap ate RAM until the system slowed to a crawl. A container starved itself of memory. None of these threw alarms. Here’s how I found them, and what I changed to keep my stack alive.
Quick Take
- SSH refused connections after a reboot due to a one-line config error
- Swap was active on 128 GB RAM when it should have stayed idle
- OpenHands ran with an 8 GB memory cap while everything else ran wild
SSH Syntax Error That Locked Me Out
The system rebooted cleanly. SSH refused connections. Port 2222 showed no listener.
systemctl status ssh showed:
● ssh.service - OpenBSD Secure Shell server
Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2024-01-01 03:14:56 UTC; 2min ago
Docs: man:sshd(8)
man:sshd_config(5)
Process: 421 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=255)
The error in /var/log/syslog:
sshd[421]: /etc/ssh/sshd_config.d/timeout.conf line 1: garbage at end of line; "ClientAliveCountMax 2".
The config file had both directives mashed together:
ClientAliveInterval 900 ClientAliveCountMax 2
Fixed with:
printf "ClientAliveInterval 900\nClientAliveCountMax 2\n" > /etc/ssh/sshd_config.d/timeout.conf
systemctl restart ssh
SSH came back on Port 2222. Lesson: never paste config lines without checking line breaks.
Swap Eating RAM on a 128 GB Unified Memory System
free -h showed 4.3 GB in swap despite 128 GB RAM free. The system lagged under load.
cat /proc/sys/vm/swappiness returned 60. The kernel swapped aggressively.
Added /etc/sysctl.d/99-sovereign.conf:
vm.swappiness=1
vm.vfs_cache_pressure=50
Applied with:
sysctl -p /etc/sysctl.d/99-sovereign.conf
free -h now shows swap at 0. Filesystem cache holds more data in RAM. The system stays responsive under load.
Postfix Running When It Shouldn’t
systemctl status postfix showed failed status. /etc/postfix/main.cf missing. Logs filled with errors.
Removed it:
apt remove --purge postfix bsd-mailx -y
No email needed on this box. Fewer services mean fewer failure points.
OpenHands Memory Limit Strangling Performance
OpenHands container ran with --memory=8g. All other containers ran unbounded. The AI stack slowed under load.
Stopped and removed the container:
docker stop openhands && docker rm openhands
Recreated without the limit:
docker run -d \
--name openhands \
--restart unless-stopped \
--network config_default \
--add-host host.docker.internal:host-gateway \
-p 127.0.0.1:3001:3000 \
-e LLM_MODEL=openai/Mistral-Small-4 \
-e LLM_BASE_URL=http://host.docker.internal:30000/v1 \
-e LLM_API_KEY=not-needed-local \
-e LLM_DROP_PARAMS=true \
-e LLM_NATIVE_TOOL_CALLING=true \
-e LLM_DISABLE_VISION=true \
-e LLM_CACHING_PROMPT=false \
-e LITELLM_LOG=DEBUG \
-e OPENHANDS_TELEMETRY=false \
-e WORKSPACE_BASE=/data/projects \
-e INIT_GIT_IN_EMPTY_WORKSPACE=1 \
-v /data/projects:/opt/workspace_base:rw \
-v /data/openhands-state:/.openhands:rw \
-v /data/openhands-state/config.toml:/app/config.toml:ro \
-v /data/openhands-state/patches/agent_controller.py:/app/openhands/controller/agent_controller.py:ro \
-v /data/openhands-state/.gitconfig:/root/.gitconfig:ro \
-v /data/secrets/git-credentials:/root/.git-credentials:ro \
-v /data/projects/shared:/shared:ro \
-v /var/run/docker.sock:/var/run/docker.sock:rw \
ghcr.io/all-hands-ai/openhands:latest
Memory now shows 0. The AI stack breathes again.
Docker Cleanup Removed Dead Weight
Pruned stopped containers:
docker container prune -f
Deleted duplicate image tag:
docker rmi ghcr.io/all-hands-ai/runtime:oh_v0.59.0_1z87fcmwpofr5a4i
Left vllm-node:latest for future LLM workloads.
What I Actually Use
- Mistral Small 4: local model for OpenHands agent work
- OpenHands: the agent framework running unconstrained
- DGX Spark: ARM64 server with 128 GB unified memory
What goes on the daily-check list now
The five failures in this post share one structural property: each was silent until something else broke. SSH-locked-out only surfaced at next reboot; swap was slowing inference invisibly until a profile run; postfix was a memory and attack-surface tax with zero log signal; OpenHands’ default memory limit looked like model performance issues; Docker bloat was just a “df is full” surprise.
After this post a daily-check shell script runs on the DGX Spark:
# /data/scripts/daily-check.sh
sshd -t || echo "SSH config broken"
swapon --show || echo "(no swap, expected)"
systemctl is-enabled postfix.service 2>/dev/null && echo "postfix unexpectedly enabled"
docker system df --format '{{.Type}}: {{.Reclaimable}}' | grep -v "0B"
df -h /data | awk 'NR==2 && +$5 > 80 {print "data partition >80% full"}'
It runs from a systemd timer at 06:00 daily and writes only on anomaly into the journal. If the journal entry says nothing, everything is fine. If something is wrong it shows up before the day’s work begins, not at the moment that work would have collided with the broken state.
The lesson worth generalizing: silent failures need active probes, not passive monitoring. The five things in this post would each have produced a Prometheus alert if a real metric had been scraped, but since they were configuration-state failures rather than runtime-metric failures, only an active probe surfaces them. Daily-check shell scripts are unfashionable but they catch this category of problem in a way that traditional observability tooling does not.
The economy of this approach is in the failure response. When the daily check fires an anomaly, the action is short and rehearsed: SSH config issue → revert from /etc/ssh/sshd_config.bak (kept from the last known-good state). Postfix re-enabled → systemctl disable --now postfix plus a rebuild check. Disk pressure → docker system prune -a plus journalctl --vacuum-time=14d. None of these are clever; all of them are faster than discovering the problem at the moment of next outage.
Silent Failures Debugged
Diagnosing and fixing self-hosted AI stack issues