How a single SSH syntax error, misconfigured swappiness, and container limits almost took down my Sovereign AI stack, and the exact commands I used to fix them.

Three Silent Failures That Would Have Killed My Self-Hosted AI Stack

SSH silently broke on reboot because I pasted two config lines into one. The port went dark. Swap ate RAM until the system slowed to a crawl. A container starved itself of memory. None of these threw alarms. Here’s how I found them, and what I changed to keep my stack alive.

Quick Take

  • SSH refused connections after a reboot due to a one-line config error
  • Swap was active on 128 GB RAM when it should have stayed idle
  • OpenHands ran with an 8 GB memory cap while everything else ran wild

SSH Syntax Error That Locked Me Out

The system rebooted cleanly. SSH refused connections. Port 2222 showed no listener.

systemctl status ssh showed:

● ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Mon 2024-01-01 03:14:56 UTC; 2min ago
       Docs: man:sshd(8)
             man:sshd_config(5)
    Process: 421 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=255)

The error in /var/log/syslog:

sshd[421]: /etc/ssh/sshd_config.d/timeout.conf line 1: garbage at end of line; "ClientAliveCountMax 2".

The config file had both directives mashed together:

ClientAliveInterval 900 ClientAliveCountMax 2

Fixed with:

printf "ClientAliveInterval 900\nClientAliveCountMax 2\n" > /etc/ssh/sshd_config.d/timeout.conf
systemctl restart ssh

SSH came back on Port 2222. Lesson: never paste config lines without checking line breaks.


Swap Eating RAM on a 128 GB Unified Memory System

free -h showed 4.3 GB in swap despite 128 GB RAM free. The system lagged under load.

cat /proc/sys/vm/swappiness returned 60. The kernel swapped aggressively.

Added /etc/sysctl.d/99-sovereign.conf:

vm.swappiness=1
vm.vfs_cache_pressure=50

Applied with:

sysctl -p /etc/sysctl.d/99-sovereign.conf

free -h now shows swap at 0. Filesystem cache holds more data in RAM. The system stays responsive under load.


Postfix Running When It Shouldn’t

systemctl status postfix showed failed status. /etc/postfix/main.cf missing. Logs filled with errors.

Removed it:

apt remove --purge postfix bsd-mailx -y

No email needed on this box. Fewer services mean fewer failure points.


OpenHands Memory Limit Strangling Performance

OpenHands container ran with --memory=8g. All other containers ran unbounded. The AI stack slowed under load.

Stopped and removed the container:

docker stop openhands && docker rm openhands

Recreated without the limit:

docker run -d \
  --name openhands \
  --restart unless-stopped \
  --network config_default \
  --add-host host.docker.internal:host-gateway \
  -p 127.0.0.1:3001:3000 \
  -e LLM_MODEL=openai/Mistral-Small-4 \
  -e LLM_BASE_URL=http://host.docker.internal:30000/v1 \
  -e LLM_API_KEY=not-needed-local \
  -e LLM_DROP_PARAMS=true \
  -e LLM_NATIVE_TOOL_CALLING=true \
  -e LLM_DISABLE_VISION=true \
  -e LLM_CACHING_PROMPT=false \
  -e LITELLM_LOG=DEBUG \
  -e OPENHANDS_TELEMETRY=false \
  -e WORKSPACE_BASE=/data/projects \
  -e INIT_GIT_IN_EMPTY_WORKSPACE=1 \
  -v /data/projects:/opt/workspace_base:rw \
  -v /data/openhands-state:/.openhands:rw \
  -v /data/openhands-state/config.toml:/app/config.toml:ro \
  -v /data/openhands-state/patches/agent_controller.py:/app/openhands/controller/agent_controller.py:ro \
  -v /data/openhands-state/.gitconfig:/root/.gitconfig:ro \
  -v /data/secrets/git-credentials:/root/.git-credentials:ro \
  -v /data/projects/shared:/shared:ro \
  -v /var/run/docker.sock:/var/run/docker.sock:rw \
  ghcr.io/all-hands-ai/openhands:latest

Memory now shows 0. The AI stack breathes again.


Docker Cleanup Removed Dead Weight

Pruned stopped containers:

docker container prune -f

Deleted duplicate image tag:

docker rmi ghcr.io/all-hands-ai/runtime:oh_v0.59.0_1z87fcmwpofr5a4i

Left vllm-node:latest for future LLM workloads.


What I Actually Use

  • Mistral Small 4: local model for OpenHands agent work
  • OpenHands: the agent framework running unconstrained
  • DGX Spark: ARM64 server with 128 GB unified memory
Flow

Silent Failures Debugged

Diagnosing and fixing self-hosted AI stack issues

1
SSH Config Error Pasted config lines broke SSH on reboot
2
Swap Memory Leak High swappiness starved system under load
3
Unnecessary Service Postfix running caused system bloat
4
Container Limits Memory cap throttled AI performance
Illustration: Three Silent Failures That Would Have Killed My Self-Hosted AI Stack