Self-Hosted Observability for a One-Person AI Stack

May 20, 2026 12 min read

The observability stack that lets one operator sleep through the night: Prometheus for metrics, Grafana for dashboards, one phone number for alerts, and the discipline to never wire an alert that is not actionable within thirty minutes. Everything else goes to a dashboard the operator checks once a day.

Quick Take

Layer 1: metrics collection. Prometheus scraping the Spark, the management host, the Floki^{₿Affiliate link. You support sovgrid at no extra cost to you. See /support.} ^↗ VPS, the Lightning node, and the MCP server. node_exporter plus per-service exporters.

Layer 2: dashboards. Grafana for the operator overview. The sovgrid customer-facing dashboard is a separate surface (Services: Sovereign Dashboard) and serves a different audience.

Layer 3: alerting. Alertmanager with one Slack-free channel (SMS gateway to operator’s phone) and one rule: actionable within 30 minutes or not an alert.

Layer 4: logs. journald on each host, sometimes shipped to a central host via systemd-journal-remote when the customer requires audit trails.

The discipline: every alert that fires that turns out not to be actionable gets retired or rephrased. Alert fatigue is the failure mode that breaks one-person observability.

What “actionable within thirty minutes” actually means

The single rule that separates a working one-person observability stack from a broken one: if an alert fires and there is no specific action the operator can take in the next thirty minutes that materially changes the outcome, the alert is the wrong shape.

Examples of right-shape alerts:

The DGX Spark’s inference service has been unhealthy for five minutes. Action: SSH in, run the recovery runbook.
The Lightning node’s channel balance has fallen below the rebalancing threshold. Action: trigger the rebalancing script or open a new channel.
Disk usage on the Spark is above 90 percent. Action: clear the model cache or move backups off-host.

Examples of wrong-shape alerts:

CPU temperature spiked briefly. Action: probably none, the cooling will handle it.
Network latency increased by 15 percent. Action: probably none, normal variation.
Memory usage above 75 percent. Action: probably none, the inference service is supposed to use memory.

The wrong-shape alerts produce operator fatigue. After the third 02:00 false alarm, the operator starts dismissing all alerts, and the next real one gets ignored too.

Layer 1: metrics collection

Prometheus runs on the management host (not on the Spark). The Spark exposes a node_exporter endpoint that Prometheus scrapes every fifteen seconds. Why Prometheus over a SaaS monitoring product? Because the SaaS products bill by data volume or by seat, and a one-person stack that runs inference workloads can spike to thousands of metrics per second during a benchmark run. At that scale, the SaaS bill becomes unpredictable. Prometheus stores everything locally, costs nothing per metric, and the data never leaves the rack.

The scrape configuration for the core services looks like this (as of May 2026, tested against Prometheus 2.51):

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: node_spark
    static_configs:
      - targets: ['spark-host:9100']
  - job_name: node_mgmt
    static_configs:
      - targets: ['localhost:9100']
  - job_name: vllm_inference
    static_configs:
      - targets: ['spark-host:8000']
    metrics_path: /metrics
  - job_name: mcp_server
    static_configs:
      - targets: ['floki-vps:9091']

Port 9100 is node_exporter. Port 9090 is Prometheus itself. Port 3000 is Grafana. Port 8000 is the vLLM/SGLang metrics endpoint. Those four ports are the ones that matter for day-to-day operations.

Per-service exporters add the workload-specific metrics. The vLLM exporter publishes throughput, queue depth, and request error rates. The Alby^{₿Affiliate link. You support sovgrid at no extra cost to you. See /support.} ^↗ Hub exporter publishes Lightning channel state and balance. The MCP server exposes a Prometheus endpoint with request count and latency histograms. The custom master.py dispatcher publishes its own metrics on a small HTTP endpoint that Prometheus scrapes.

For SGLang and vLLM, the inference engine does not expose Prometheus metrics natively by default. I run a thin wrapper at /data/scripts/ops/vllm_exporter.py that polls the engine’s /metrics endpoint every 10 seconds and re-exposes the data in Prometheus format. The exporter watches three signals: vllm:num_requests_running, vllm:gpu_cache_usage_perc, and vllm:avg_generation_throughput_toks_per_s. When cache usage crosses 90 percent, the request queue starts to back up and latency spikes. That is the signal worth alerting on.

The metric naming convention follows Prometheus’s own guide: lowercase with underscores, units in the name (_seconds, _bytes, _total), and labels rather than metric names for cardinality. (See the Prometheus naming guide for the canonical version.)

The retention is fourteen days in Prometheus’s local TSDB. Longer-term retention goes to a downsampled archive (Thanos or VictoriaMetrics for the deep-history case), but for a one-operator stack at sovgrid scale, fourteen days is enough. The 14-day window covers two full weekly maintenance cycles, which is enough to correlate incidents with deployment changes.

Layer 2: dashboards

Grafana has two dashboards that matter.

The operator-overview dashboard shows the state of the whole stack on one page. Top row: inference throughput, error rate, queue depth. Second row: Spark CPU, GPU, memory, disk. Third row: Floki VPS, MCP server, Lightning node. Fourth row: alert state and any known incidents.

The core panel for inference throughput looks like this in Grafana’s panel JSON (tested on Grafana 10.2, as of May 2026):

{
  "title": "Inference Throughput (tok/s)",
  "type": "timeseries",
  "targets": [
    {
      "expr": "vllm:avg_generation_throughput_toks_per_s",
      "legendFormat": "tokens/s"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "steps": [
          { "color": "red",    "value": 0  },
          { "color": "yellow", "value": 20 },
          { "color": "green",  "value": 50 }
        ]
      }
    }
  }
}

Those threshold values (20 tok/s yellow, 50 tok/s green) are calibrated for Qwen3.6 on the DGX Spark. A different model on different hardware will need different numbers. The thresholds are the part that requires iteration; the panel structure is reusable.

The operator-overview is the page the operator opens once a day to confirm the stack is healthy. If everything is green and the time range looks normal, the day’s observability work is done. If something is yellow, the operator investigates; if red, the alert should have fired.

The customer-facing dashboard (sovgrid dashboard, see Services: Sovereign Dashboard) is a separate Grafana instance with different access. Customers see the metrics that prove their workload is healthy on their hardware. They do not see the operator’s internal metrics.

The separation matters. The operator-facing dashboard can show “Spark CPU at 78 percent” because the operator knows the context. The customer-facing dashboard shows “Your workload completed in 1.2 seconds, well within the 5-second SLA” because the customer needs the answer, not the raw signal.

Layer 3: alerting

Alertmanager runs on the management host and routes alerts via two channels.

Channel 1: the operator’s phone, via SMS gateway. One channel, one phone number, hard-rate-limited to 4 alerts per hour to prevent runaway alerting. The SMS gateway is a small VPS-hosted relay (the sovgrid stack uses a custom service running on Floki, not a commercial SMS provider, for cost and sovereignty reasons). Why no PagerDuty or OpsGenie? Because those services require sending alert content to a third-party cloud, which is incompatible with the sovereignty constraints of the stack. Customer workload metadata should not leave the operator’s infrastructure. A self-hosted SMS relay via Floki keeps the alert pipeline on owned hardware.

Channel 2: the operator’s email inbox. Lower-priority alerts (warnings, daily summaries) go here. The email is parseable by the operator’s existing mail client. No special tool required.

The two channels match the two priority levels: phone for “you need to handle this in the next thirty minutes,” email for “you should know about this when you next check email.”

An example alert rule for inference engine health looks like this:

# /etc/prometheus/rules/inference.yml
groups:
  - name: inference
    rules:
      - alert: InferenceEngineDown
        expr: up{job="vllm_inference"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Inference engine unreachable for 5 minutes"
          action: "SSH to spark-host, run: systemctl status vllm.service && journalctl -u vllm.service -n 50"
      - alert: GPUCacheNearFull
        expr: vllm:gpu_cache_usage_perc > 0.90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "GPU KV cache above 90 percent"
          action: "Reduce max_num_seqs or restart with smaller context window"

Every rule carries a concrete action: annotation. That discipline is the difference between an alert that the operator knows how to handle and one that gets silenced after the third 02:00 page.

Alert rules are documented in version-controlled YAML in the prometheus-alerts/ directory of the ops repository. Every rule has a comment block explaining the action the operator should take when it fires. If the comment is empty or says “investigate,” the rule is the wrong shape and should be retired or rephrased.

Layer 4: logs

journald on each host is the default log destination. Most services write to journal via stdout/stderr (the systemd default), and journalctl is the operator’s primary log-querying tool. Why log-based metrics over custom instrumentation at the start? Because adding a custom Prometheus counter to every service requires changing every service. journald is already there. The cost of starting with log-based metrics is near zero, and this means the operator can query incidents from day one, before any custom exporter is written.

For multi-host log queries, systemd-journal-remote ships journal entries from the Spark to the management host. The management host then has a full record of what happened across the stack, indexed by time and host.

A useful pattern for querying inference engine errors across a time window:

# Query inference errors from the last 2 hours across all units
journalctl --since "2 hours ago" \
  --unit vllm.service \
  --unit sglang.service \
  --priority err..crit \
  --output cat \
  | grep -E "(ERROR|OOM|CUDA|failed|exit code)"

For Grafana log panels (via Loki or direct journald), the same pattern applies: filter by priority level, not by log volume. High-volume info logs from the inference engine (token generation traces) go to a separate log stream that is not aggregated to the central host. Only error-level and above crosses the wire, because this means the central host stays fast and the operator does not drown in throughput logs when debugging a real incident.

For customer engagements with audit-trail requirements, the journal can be shipped to an additional log-aggregation host on the customer’s premises. (See Sovereign AI Healthcare: GDPR / HIPAA / DGX Spark, publication pending, for the audit-trail requirements that drive this.) The journal format is enough to satisfy most compliance audit trails when paired with a tamper-evident chain (which can be added with journalctl --verify).

Why the DGX Spark as the inference host rather than the mini-PC for the observability services? Because Prometheus and Grafana are lightweight (the Prometheus process uses under 500 MB RAM even with 14 days of retention at this scrape volume), and the Spark’s 128 GB unified memory is better reserved for model weights. The management host runs all observability tooling: Prometheus on port 9090, Grafana on port 3000, Alertmanager on port 9093, and the SMS relay. That separation also means the observability stack stays up when the Spark is rebooted for a model swap, which happens several times a week.

Caveats: what one-person observability does not cover

Three limits worth naming before treating this stack as complete.

Caveat 1: coverage blindspots. This stack monitors what it knows to monitor. If a service is not scraped, it is invisible. The first sign of a missing exporter is usually a customer report, not an alert. As of May 2026, the sovgrid stack has no exporter for the ComfyUI service and no scrape target for the backup verification job. Those are known gaps, not covered by the current alert rules.

Caveat 2: the staleness problem. One-person observability decays when the operator is busy. Alert rules do not update themselves. A threshold calibrated for a 7B-parameter model is wrong for a 32B model loaded three months later. The dashboard can show “healthy” for a configuration that has drifted significantly from the state when the alert thresholds were written. Scheduled reviews every six to eight weeks are the only mitigation.

Caveat 3: alert fatigue is the ceiling. The 4-alerts-per-hour rate limit is a hard ceiling on the SMS channel. Once alert fatigue sets in and the operator starts dismissing alerts without reading them, the entire observability investment is worthless. I have hit this twice in the first three months of operation. Both times the fix was retiring three or four rules, not adding new ones. Alert discipline is harder than metric collection.

The first six weeks of an observability stack

Operator-overview dashboards take roughly six weeks to stabilize. The first version of the dashboard will alert on the wrong things and miss the right things. The right shape emerges from iteration.

Week 1: install Prometheus and Grafana, write the first dashboard, get the basic node_exporter metrics flowing. Most metrics are noise; most alerts will be wrong.

Week 2: add the per-service exporters. The inference engine’s throughput and error rates are now visible. The alert rules start to look right but still fire on benign variations.

Week 3: the first real incident happens. The dashboard either showed the right signal (success) or did not (rework). The alert either fired (success) or did not (urgent rework). Update the dashboard and the alert rule based on the incident.

Week 4-6: iterate. Each false alarm gets the alert rephrased or retired. Each missed incident gets a new rule added. By week six the false-alarm rate is low enough that the operator trusts the alert channel, which is the precondition for the whole stack to work.

Skipping weeks 4-6 is the most common observability failure mode. The dashboard is “done” at week 3 and the operator stops iterating; the rules stay miscalibrated, the alerts get ignored, the real incident slips through. Six weeks of iteration is the minimum.

Where this fits

For the broader operational context, see The Sovereign AI Stack in 2026. For the systemd patterns that integrate with observability, see systemd Patterns for Self-Hosted AI Services. For the customer-facing dashboard architecture, see Services: Sovereign Dashboard and the multi-part Sovereign Dev Studio series starting at Setup: Sovereign Dev Studio v2.2 Part 1.

The follow-up article publishes the actual Alertmanager rule set in use on the sovgrid stack, with each rule annotated with the action it triggers and the rationale for the threshold. Subscribe via the footer to catch it.

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—