Three Silent Failures That Would Have Killed My Self-Hosted AI Stack

March 29, 2026 4 min read

SSH silently broke on reboot because I pasted two config lines into one. The port went dark. Swap ate RAM until the system slowed to a crawl. A container starved itself of memory. None of these threw alarms. Here’s how I found them, and what I changed to keep my stack alive.

Quick Take

SSH refused connections after a reboot due to a one-line config error

Swap was active on 128 GB RAM when it should have stayed idle

OpenHands ran with an 8 GB memory cap while everything else ran wild

SSH Syntax Error That Locked Me Out

The system rebooted cleanly. SSH refused connections. Port 2222 showed no listener.

systemctl status ssh showed:

● ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Mon 2024-01-01 03:14:56 UTC; 2min ago
       Docs: man:sshd(8)
             man:sshd_config(5)
    Process: 421 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=255)

The error in /var/log/syslog:

sshd[421]: /etc/ssh/sshd_config.d/timeout.conf line 1: garbage at end of line; "ClientAliveCountMax 2".

The config file had both directives mashed together:

ClientAliveInterval 900 ClientAliveCountMax 2

Fixed with:

printf "ClientAliveInterval 900\nClientAliveCountMax 2\n" > /etc/ssh/sshd_config.d/timeout.conf
systemctl restart ssh

SSH came back on Port 2222. Lesson: never paste config lines without checking line breaks.

Swap Eating RAM on a 128 GB Unified Memory System

free -h showed 4.3 GB in swap despite 128 GB RAM free. The system lagged under load.

cat /proc/sys/vm/swappiness returned 60. The kernel swapped aggressively.

Added /etc/sysctl.d/99-sovereign.conf:

vm.swappiness=1
vm.vfs_cache_pressure=50

Applied with:

sysctl -p /etc/sysctl.d/99-sovereign.conf

free -h now shows swap at 0. Filesystem cache holds more data in RAM. The system stays responsive under load.

Postfix Running When It Shouldn’t

systemctl status postfix showed failed status. /etc/postfix/main.cf missing. Logs filled with errors.

Removed it:

apt remove --purge postfix bsd-mailx -y

No email needed on this box. Fewer services mean fewer failure points.

OpenHands Memory Limit Strangling Performance

OpenHands container ran with --memory=8g. All other containers ran unbounded. The AI stack slowed under load.

Stopped and removed the container:

docker stop openhands && docker rm openhands

Recreated without the limit:

docker run -d \
  --name openhands \
  --restart unless-stopped \
  --network config_default \
  --add-host host.docker.internal:host-gateway \
  -p 127.0.0.1:3001:3000 \
  -e LLM_MODEL=openai/Mistral-Small-4 \
  -e LLM_BASE_URL=http://host.docker.internal:30000/v1 \
  -e LLM_API_KEY=not-needed-local \
  -e LLM_DROP_PARAMS=true \
  -e LLM_NATIVE_TOOL_CALLING=true \
  -e LLM_DISABLE_VISION=true \
  -e LLM_CACHING_PROMPT=false \
  -e LITELLM_LOG=DEBUG \
  -e OPENHANDS_TELEMETRY=false \
  -e WORKSPACE_BASE=/data/projects \
  -e INIT_GIT_IN_EMPTY_WORKSPACE=1 \
  -v /data/projects:/opt/workspace_base:rw \
  -v /data/openhands-state:/.openhands:rw \
  -v /data/openhands-state/config.toml:/app/config.toml:ro \
  -v /data/openhands-state/patches/agent_controller.py:/app/openhands/controller/agent_controller.py:ro \
  -v /data/openhands-state/.gitconfig:/root/.gitconfig:ro \
  -v /data/secrets/git-credentials:/root/.git-credentials:ro \
  -v /data/projects/shared:/shared:ro \
  -v /var/run/docker.sock:/var/run/docker.sock:rw \
  ghcr.io/all-hands-ai/openhands:latest

Memory now shows 0. The AI stack breathes again.

Docker Cleanup Removed Dead Weight

Pruned stopped containers:

docker container prune -f

Deleted duplicate image tag:

docker rmi ghcr.io/all-hands-ai/runtime:oh_v0.59.0_1z87fcmwpofr5a4i

Left vllm-node:latest for future LLM workloads.

What I Actually Use

Mistral Small 4: local model for OpenHands agent work

OpenHands: the agent framework running unconstrained

DGX Spark: ARM64 server with 128 GB unified memory

What goes on the daily-check list now

The five failures in this post share one structural property: each was silent until something else broke. SSH-locked-out only surfaced at next reboot; swap was slowing inference invisibly until a profile run; postfix was a memory and attack-surface tax with zero log signal; OpenHands’ default memory limit looked like model performance issues; Docker bloat was just a “df is full” surprise.

After this post a daily-check shell script runs on the DGX Spark:

# /data/scripts/daily-check.sh
sshd -t || echo "SSH config broken"
swapon --show || echo "(no swap, expected)"
systemctl is-enabled postfix.service 2>/dev/null && echo "postfix unexpectedly enabled"
docker system df --format '{{.Type}}: {{.Reclaimable}}' | grep -v "0B"
df -h /data | awk 'NR==2 && +$5 > 80 {print "data partition >80% full"}'

It runs from a systemd timer at 06:00 daily and writes only on anomaly into the journal. If the journal entry says nothing, everything is fine. If something is wrong it shows up before the day’s work begins, not at the moment that work would have collided with the broken state.

The lesson worth generalizing: silent failures need active probes, not passive monitoring. The five things in this post would each have produced a Prometheus alert if a real metric had been scraped, but since they were configuration-state failures rather than runtime-metric failures, only an active probe surfaces them. Daily-check shell scripts are unfashionable but they catch this category of problem in a way that traditional observability tooling does not.

The economy of this approach is in the failure response. When the daily check fires an anomaly, the action is short and rehearsed: SSH config issue → revert from /etc/ssh/sshd_config.bak (kept from the last known-good state). Postfix re-enabled → systemctl disable --now postfix plus a rebuild check. Disk pressure → docker system prune -a plus journalctl --vacuum-time=14d. None of these are clever; all of them are faster than discovering the problem at the moment of next outage.

Flow

Silent Failures Debugged

Diagnosing and fixing self-hosted AI stack issues

SSH Config Error Pasted config lines broke SSH on reboot

Swap Memory Leak High swappiness starved system under load

Unnecessary Service Postfix running caused system bloat

Container Limits Memory cap throttled AI performance

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—