Monitoring one VPS with a Prometheus stack is like hiring a security team for a garden shed. I wrote a 315-line bash script instead: one SSH session, twelve checks, one morning notification. Here is the design, the honest comparison against the usual suspects, and why detect-and-alert beats auto-fix at this scale.

vps-healthcheck: Twelve Daily Checks, One SSH Session, One Notification

Running a single VPS for a personal project puts you in an awkward spot with monitoring. A full Prometheus + Alertmanager + Grafana stack is three services to operate, secure, and upgrade, which is itself a small fleet you now have to watch. Netdata wants a daemon on the monitored host with broad permissions. Uptime Kuma tells you when your site is down but says nothing about a disk quietly filling, an OOM kill at 3 a.m., or a cert that expires on Saturday. Monit is daemonized. Every one of these is more moving parts than the thing you are trying to watch.

I wrote a 315-line bash script instead. It SSHes into the box once a day, runs twelve checks in a single session, and sends one notification. That is the whole tool: vps-healthcheck, MIT, no dependencies beyond bash and SSH.

What it checks

#CheckAlert level
1SSH reachability + public URL returns 200critical if either fails
2Reboot-required flagwarn
3Pending Debian security updateswarn at 1, critical at 5
4Failed systemd unitscritical
5UFW activecritical if inactive
6Fail2ban currently-banned (sshd jail)informational
7Disk usage on / and /boot/efiwarn at 85%, critical at 95%
8Memory availablewarn below 200 MiB
9OOM kills in the last 24hcritical
10Let’s Encrypt cert expiry per domainwarn at 14d, critical at 3d
11Expected docker containers all Upcritical on missing
12Optional freshness file mtimewarn if stale beyond 26h

The selection is deliberately opinionated. These twelve are the failure classes that external uptime monitoring cannot see and that actually rot a small server: disks, memory pressure, dead units, stale certs, a firewall someone disabled “temporarily”. Check 12 is the generic hook: point it at the output file of any nightly job and the healthcheck doubles as a dead-man switch for your cron jobs.

How a check actually works

Take the cert-expiry probe, check 10, because it is the one most worth getting right. It opens a TLS connection to each domain, reads the certificate’s expiry date, and converts it to days remaining. It warns at 14 days and goes critical at 3 days. That matters because Let’s Encrypt certificates last 90 days and renew automatically, which means the failure mode is silent: renewal breaks quietly and you find out only when a visitor hits the browser error, often weeks later. A 14-day warning turns that silent decay into a calm Tuesday-morning fix rather than a Saturday-night outage.

The disk probe, check 7, is the same shape. It reads usage on / and /boot/efi, warns at 85%, and goes critical at 95%. The 95% line exists because a full root filesystem does not just stop new writes, it can wedge the package manager and journald, i.e. it takes out the tools you would use to recover. Catching it at 85% buys you days of runway instead of an emergency. Every threshold in the script is a number you can argue with and change in one place.

Setup in three steps

git clone https://github.com/cipherfoxie/vps-healthcheck.git
cd vps-healthcheck

# configure via env or a conf file alongside the script
export HC_SSH_HOST=my-vps
export HC_REACH_URL=https://my-domain.example/
export HC_CERT_DOMAINS="my-domain.example"
export HC_EXPECTED_CONTAINERS="caddy app"
export HC_NOTIFY="/path/to/my-notify-script"   # optional

# test without pushing a notification
bash vps-healthcheck.sh --dry-run

# cron, daily at 07:30
# 30 7 * * * /usr/bin/bash /path/to/vps-healthcheck.sh --quiet >> /var/log/vps-healthcheck.log 2>&1

Exit codes are boring on purpose: 0 = clean, 1 = warnings, 2 = critical. Cron mails on non-zero by default, so even with no notify script configured you get a usable signal.

HC_NOTIFY is called as "$HC_NOTIFY" <title> <body>. Anything that accepts two string arguments works: a Matrix bridge, an ntfy.sh curl one-liner, a Telegram bot, plain notify-send. The script does not know or care what is behind it.

Why not use an existing tool?

The honest comparison, because every tool in this table is good at what it is actually for:

ToolBest atWhy it did not fit
Prometheus + Alertmanager + GrafanaFleet observability, PromQL, multi-day trendsThree services to run, maintain, and monitor themselves
NetdataReal-time dashboards, subsecond resolutionDaemon on the host, broad permissions, web UI attack surface
Uptime KumaExternal uptime and status pagesCannot see disk, memory, OOM, or cert state from inside the box
MonitService-level restarts, long-history graphsDaemonized on the host
Healthchecks.ioPush-based cron heartbeatsMonitors jobs, not host health

vps-healthcheck occupies one specific position: you already trust your SSH access, you refuse to run a daemon for this, and what you want is morning-coffee confidence. It is agentless by design rather than agent-based like Netdata, and pull-from-outside rather than push-from-inside like Healthchecks.io, because for a box you already SSH into daily the extra daemon buys nothing and adds attack surface. One VPS, maybe two. Beyond that, run Prometheus and do it properly: this tool is a deliberate floor, not a ceiling.

The single-session design

The script issues exactly one ssh call per run. Inside that session, a heredoc performs all twelve probes and prints a structured key=value block; the local side parses it and makes every alerting decision without a second round-trip. This is not cleverness, it is failure-surface math: each additional SSH session is another thing that can hang, time out, or half-succeed. On a morning check you want exactly one network-layer failure mode, and check 1 already covers it.

The trade-off is that every check lives inside the constraint of one heredoc, i.e. a block of commands piped to the remote shell in a single connection. That keeps the whole thing readable in one sitting and trivially auditable, but a misbehaving probe delays the rest. The connect timeout is 15 seconds and SSH runs with BatchMode on, which means it never blocks waiting for a password prompt: if key auth fails it errors out immediately rather than hanging the morning run. The individual probes are all sub-second commands, so a healthy run finishes in roughly two seconds rather than the minutes a multi-session approach would spend reconnecting.

There is no auto-fix, also on purpose. A script that restarts services on its own is a script that can mask a real problem at 4 a.m. and turn a clean postmortem into archaeology. Detect, aggregate, notify once, let the human decide. The same philosophy drove watchdocker, this tool’s sibling: smallest possible surface, zero resident processes, full audit in one read.

Do I run it myself?

Yes. The reference deployment runs daily at 07:30 via cron, from my workstation against the VPS that hosts this blog, with a Matrix push as the notify script. It has been in daily production since the VPS was hardened, and the morning digest is exactly one message: green and one line, or yellow/red and a reason. That is the entire user experience, and that is the point.

Limitations

The script runs from your workstation, so if your workstation is off, the check does not run that day. For the primary use case (a morning cron on a machine you use daily) that is acceptable; as an always-on external probe it is the wrong tool, and you should pair it with a dedicated uptime service for the outside view.

It assumes a Debian-family target (apt security updates, UFW, journalctl). Porting the heredoc to another distro is a half-hour job, but out of the box that is the supported surface.

And it monitors one host per invocation. Two VPSes means two cron lines. Ten means you have outgrown it.

Reproduce it

Repo: github.com/cipherfoxie/vps-healthcheck. MIT license, single file, CI runs shellcheck on every push. vps-healthcheck is one of the tools this stack releases back into the open; the full list lives on the upstream page.


Part of the sovereign-tools series: small, self-contained utilities for operators running their own infrastructure. Each one solves exactly one problem, reads in under twenty minutes, and has no runtime dependencies beyond what is already on the box. Previous: watchdocker, the bash-native watchtower successor. Follow via RSS or Nostr.

Was this worth it? Zap the article.

Value for value, no signup. Sats go straight to the writer.