Three Silent Failures That Would Have Killed My Self-Hosted AI Stack
SSH silently broke on reboot because I pasted two config lines into one. The port went dark. Swap ate RAM until the system slowed to a crawl. A container starved itself of memory. None of these threw alarms. Here’s how I found them, and what I changed to keep my stack alive.
Quick Take
- SSH refused connections after a reboot due to a one-line config error
- Swap was active on 128 GB RAM when it should have stayed idle
- OpenHands ran with an 8 GB memory cap while everything else ran wild
SSH Syntax Error That Locked Me Out
The system rebooted cleanly. SSH refused connections. Port 2222 showed no listener.
systemctl status ssh showed:
● ssh.service - OpenBSD Secure Shell server
Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2024-01-01 03:14:56 UTC; 2min ago
Docs: man:sshd(8)
man:sshd_config(5)
Process: 421 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=255)
The error in /var/log/syslog:
sshd[421]: /etc/ssh/sshd_config.d/timeout.conf line 1: garbage at end of line; "ClientAliveCountMax 2".
The config file had both directives mashed together:
ClientAliveInterval 900 ClientAliveCountMax 2
Fixed with:
printf "ClientAliveInterval 900\nClientAliveCountMax 2\n" > /etc/ssh/sshd_config.d/timeout.conf
systemctl restart ssh
SSH came back on Port 2222. Lesson: never paste config lines without checking line breaks.
Swap Eating RAM on a 128 GB Unified Memory System
free -h showed 4.3 GB in swap despite 128 GB RAM free. The system lagged under load.
cat /proc/sys/vm/swappiness returned 60. The kernel swapped aggressively.
Added /etc/sysctl.d/99-sovereign.conf:
vm.swappiness=1
vm.vfs_cache_pressure=50
Applied with:
sysctl -p /etc/sysctl.d/99-sovereign.conf
free -h now shows swap at 0. Filesystem cache holds more data in RAM. The system stays responsive under load.
Postfix Running When It Shouldn’t
systemctl status postfix showed failed status. /etc/postfix/main.cf missing. Logs filled with errors.
Removed it:
apt remove --purge postfix bsd-mailx -y
No email needed on this box. Fewer services mean fewer failure points.
OpenHands Memory Limit Strangling Performance
OpenHands container ran with --memory=8g. All other containers ran unbounded. The AI stack slowed under load.
Stopped and removed the container:
docker stop openhands && docker rm openhands
Recreated without the limit:
docker run -d \
--name openhands \
--restart unless-stopped \
--network config_default \
--add-host host.docker.internal:host-gateway \
-p 127.0.0.1:3001:3000 \
-e LLM_MODEL=openai/Mistral-Small-4 \
-e LLM_BASE_URL=http://host.docker.internal:30000/v1 \
-e LLM_API_KEY=not-needed-local \
-e LLM_DROP_PARAMS=true \
-e LLM_NATIVE_TOOL_CALLING=true \
-e LLM_DISABLE_VISION=true \
-e LLM_CACHING_PROMPT=false \
-e LITELLM_LOG=DEBUG \
-e OPENHANDS_TELEMETRY=false \
-e WORKSPACE_BASE=/data/projects \
-e INIT_GIT_IN_EMPTY_WORKSPACE=1 \
-v /data/projects:/opt/workspace_base:rw \
-v /data/openhands-state:/.openhands:rw \
-v /data/openhands-state/config.toml:/app/config.toml:ro \
-v /data/openhands-state/patches/agent_controller.py:/app/openhands/controller/agent_controller.py:ro \
-v /data/openhands-state/.gitconfig:/root/.gitconfig:ro \
-v /data/secrets/git-credentials:/root/.git-credentials:ro \
-v /data/projects/shared:/shared:ro \
-v /var/run/docker.sock:/var/run/docker.sock:rw \
ghcr.io/all-hands-ai/openhands:latest
Memory now shows 0. The AI stack breathes again.
Docker Cleanup Removed Dead Weight
Pruned stopped containers:
docker container prune -f
Deleted duplicate image tag:
docker rmi ghcr.io/all-hands-ai/runtime:oh_v0.59.0_1z87fcmwpofr5a4i
Left vllm-node:latest for future LLM workloads.
What I Actually Use
- Mistral Small 4: local model for OpenHands agent work
- OpenHands: the agent framework running unconstrained
- DGX Spark: ARM64 server with 128 GB unified memory
Silent Failures Debugged
Diagnosing and fixing self-hosted AI stack issues