vps-healthcheck: Twelve Daily Checks, One SSH Session, One Notification

June 10, 2026 7 min read

Running a single VPS for a personal project puts you in an awkward spot with monitoring. A full Prometheus + Alertmanager + Grafana stack is three services to operate, secure, and upgrade, which is itself a small fleet you now have to watch. Netdata wants a daemon on the monitored host with broad permissions. Uptime Kuma tells you when your site is down but says nothing about a disk quietly filling, an OOM kill at 3 a.m., or a cert that expires on Saturday. Monit is daemonized. Every one of these is more moving parts than the thing you are trying to watch.

I wrote a 315-line bash script instead. It SSHes into the box once a day, runs twelve checks in a single session, and sends one notification. That is the whole tool: vps-healthcheck, MIT, no dependencies beyond bash and SSH.

What it checks

#	Check	Alert level
1	SSH reachability + public URL returns 200	critical if either fails
2	Reboot-required flag	warn
3	Pending Debian security updates	warn at 1, critical at 5
4	Failed systemd units	critical
5	UFW active	critical if inactive
6	Fail2ban currently-banned (sshd jail)	informational
7	Disk usage on / and /boot/efi	warn at 85%, critical at 95%
8	Memory available	warn below 200 MiB
9	OOM kills in the last 24h	critical
10	Let’s Encrypt cert expiry per domain	warn at 14d, critical at 3d
11	Expected docker containers all Up	critical on missing
12	Optional freshness file mtime	warn if stale beyond 26h

The selection is deliberately opinionated. These twelve are the failure classes that external uptime monitoring cannot see and that actually rot a small server: disks, memory pressure, dead units, stale certs, a firewall someone disabled “temporarily”. Check 12 is the generic hook: point it at the output file of any nightly job and the healthcheck doubles as a dead-man switch for your cron jobs.

How a check actually works

Take the cert-expiry probe, check 10, because it is the one most worth getting right. It opens a TLS connection to each domain, reads the certificate’s expiry date, and converts it to days remaining. It warns at 14 days and goes critical at 3 days. That matters because Let’s Encrypt certificates last 90 days and renew automatically, which means the failure mode is silent: renewal breaks quietly and you find out only when a visitor hits the browser error, often weeks later. A 14-day warning turns that silent decay into a calm Tuesday-morning fix rather than a Saturday-night outage.

The disk probe, check 7, is the same shape. It reads usage on / and /boot/efi, warns at 85%, and goes critical at 95%. The 95% line exists because a full root filesystem does not just stop new writes, it can wedge the package manager and journald, i.e. it takes out the tools you would use to recover. Catching it at 85% buys you days of runway instead of an emergency. Every threshold in the script is a number you can argue with and change in one place.

Setup in three steps

git clone https://github.com/cipherfoxie/vps-healthcheck.git
cd vps-healthcheck

# configure via env or a conf file alongside the script
export HC_SSH_HOST=my-vps
export HC_REACH_URL=https://my-domain.example/
export HC_CERT_DOMAINS="my-domain.example"
export HC_EXPECTED_CONTAINERS="caddy app"
export HC_NOTIFY="/path/to/my-notify-script"   # optional

# test without pushing a notification
bash vps-healthcheck.sh --dry-run

# cron, daily at 07:30
# 30 7 * * * /usr/bin/bash /path/to/vps-healthcheck.sh --quiet >> /var/log/vps-healthcheck.log 2>&1

Exit codes are boring on purpose: 0 = clean, 1 = warnings, 2 = critical. Cron mails on non-zero by default, so even with no notify script configured you get a usable signal.

HC_NOTIFY is called as "$HC_NOTIFY" <title> <body>. Anything that accepts two string arguments works: a Matrix bridge, an ntfy.sh curl one-liner, a Telegram bot, plain notify-send. The script does not know or care what is behind it.

Why not use an existing tool?

The honest comparison, because every tool in this table is good at what it is actually for:

Tool	Best at	Why it did not fit
Prometheus + Alertmanager + Grafana	Fleet observability, PromQL, multi-day trends	Three services to run, maintain, and monitor themselves
Netdata	Real-time dashboards, subsecond resolution	Daemon on the host, broad permissions, web UI attack surface
Uptime Kuma	External uptime and status pages	Cannot see disk, memory, OOM, or cert state from inside the box
Monit	Service-level restarts, long-history graphs	Daemonized on the host
Healthchecks.io	Push-based cron heartbeats	Monitors jobs, not host health

vps-healthcheck occupies one specific position: you already trust your SSH access, you refuse to run a daemon for this, and what you want is morning-coffee confidence. It is agentless by design rather than agent-based like Netdata, and pull-from-outside rather than push-from-inside like Healthchecks.io, because for a box you already SSH into daily the extra daemon buys nothing and adds attack surface. One VPS, maybe two. Beyond that, run Prometheus and do it properly: this tool is a deliberate floor, not a ceiling.

The single-session design

The script issues exactly one ssh call per run. Inside that session, a heredoc performs all twelve probes and prints a structured key=value block; the local side parses it and makes every alerting decision without a second round-trip. This is not cleverness, it is failure-surface math: each additional SSH session is another thing that can hang, time out, or half-succeed. On a morning check you want exactly one network-layer failure mode, and check 1 already covers it.

The trade-off is that every check lives inside the constraint of one heredoc, i.e. a block of commands piped to the remote shell in a single connection. That keeps the whole thing readable in one sitting and trivially auditable, but a misbehaving probe delays the rest. The connect timeout is 15 seconds and SSH runs with BatchMode on, which means it never blocks waiting for a password prompt: if key auth fails it errors out immediately rather than hanging the morning run. The individual probes are all sub-second commands, so a healthy run finishes in roughly two seconds rather than the minutes a multi-session approach would spend reconnecting.

There is no auto-fix, also on purpose. A script that restarts services on its own is a script that can mask a real problem at 4 a.m. and turn a clean postmortem into archaeology. Detect, aggregate, notify once, let the human decide. The same philosophy drove watchdocker, this tool’s sibling: smallest possible surface, zero resident processes, full audit in one read.

Do I run it myself?

Yes. The reference deployment runs daily at 07:30 via cron, from my workstation against the VPS that hosts this blog, with a Matrix push as the notify script. It has been in daily production since the VPS was hardened, and the morning digest is exactly one message: green and one line, or yellow/red and a reason. That is the entire user experience, and that is the point.

Limitations

The script runs from your workstation, so if your workstation is off, the check does not run that day. For the primary use case (a morning cron on a machine you use daily) that is acceptable; as an always-on external probe it is the wrong tool, and you should pair it with a dedicated uptime service for the outside view.

It assumes a Debian-family target (apt security updates, UFW, journalctl). Porting the heredoc to another distro is a half-hour job, but out of the box that is the supported surface.

And it monitors one host per invocation. Two VPSes means two cron lines. Ten means you have outgrown it.

Reproduce it

Repo: github.com/cipherfoxie/vps-healthcheck. MIT license, single file, CI runs shellcheck on every push. vps-healthcheck is one of the tools this stack releases back into the open; the full list lives on the upstream page.

Part of the sovereign-tools series: small, self-contained utilities for operators running their own infrastructure. Each one solves exactly one problem, reads in under twenty minutes, and has no runtime dependencies beyond what is already on the box. Previous: watchdocker, the bash-native watchtower successor. Follow via RSS or Nostr.

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—