Four Bugs That Only Showed Up Under Load: Fixing a FastAPI Dashboard

March 19, 2026

Quick Take

Async FastAPI endpoints running synchronously under load

Docker inspect calls multiplying like rabbits

systemd’s ProtectSystem eating your AIDE updates

Frontend polling requests piling up like unpaid invoices

The dashboard worked fine when I wasn’t using it. Then I fired up Locust and watched the whole thing collapse under 50 concurrent users. Four separate bugs, all invisible until the system was stressed. Here’s exactly what broke, why it broke, and how I fixed each one.

Async Endpoint Blocking the Event Loop

The /api/status endpoint was marked async def but every check ran synchronously. subprocess.run() blocked the uvicorn event loop completely. When du -sb /ai/models timed out at 20 seconds, nothing else could process, not even /api/aide/resolve.

@app.get("/api/status")
async def get_status(_: bool = Depends(verify_token)):
    # This blocks the event loop for 20+ seconds
    gpu = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
    mem = parse_memory()
    disk = parse_disk()  # 20s timeout
    # ...

The fix was straightforward: run blocking calls in executors and await them all in parallel.

@app.get("/api/status")
async def get_status(_: bool = Depends(verify_token)):
    loop = asyncio.get_running_loop()
    gpu, mem, disk, tor, ufw, containers, f2b, aide, butler_st, hidden, llm, wt = await asyncio.gather(
        loop.run_in_executor(None, parse_nvidia_smi),
        loop.run_in_executor(None, parse_memory),
        loop.run_in_executor(None, parse_disk),
        loop.run_in_executor(None, check_tor),
        loop.run_in_executor(None, check_ufw),
        loop.run_in_executor(None, check_containers),
        loop.run_in_executor(None, check_fail2ban),
        loop.run_in_executor(None, check_aide),
        loop.run_in_executor(None, check_butler_status),
        loop.run_in_executor(None, check_hidden_services),
        loop.run_in_executor(None, check_llm_service),
        loop.run_in_executor(None, check_watchtower),
    )
    return {
        "gpu": gpu,
        "memory": mem,
        "disk": disk,
        "tor": tor,
        "firewall": ufw,
        "containers": containers,
        "fail2ban": f2b,
        "aide": aide,
        "butler": butler_st,
        "hidden": hidden,
        "llm": llm,
        "watchtower": wt,
    }

Now the response time equals the longest running check instead of the sum of all checks. Under load, it’s 2-3 seconds instead of 40.

N+1 Docker Calls in Container Status Checks

The check_containers() function called docker inspect once per container to check Watchtower labels. Ten containers meant eleven subprocesses just to build the status list. Under load, this multiplied latency by the number of containers.

for container in containers:
    inspect = subprocess.run(
        ["docker", "inspect", container, "--format", '{{index .Config.Labels "com.centurylinklabs.watchtower.enable"}}'],
        capture_output=True, text=True
    )
    if "true" in inspect.stdout:
        watchtower_containers.append(container)

I tried using --format with Go template indexing, but Docker’s .Labels isn’t a map in Go terms.

# This fails with "cannot index slice/array with type string"
docker ps --format '...,"watchtower":"{{index .Labels "com.centurylinklabs.watchtower.enable"}}"...'

The real fix was two calls instead of N+1.

# Call 1: get all container names and statuses in one shot
raw = subprocess.run(
    ["docker", "ps", "-a", "--format", '{"name":"{{.Names}}","status":"{{.Status}}"...}'],
    capture_output=True, text=True
)
containers = json.loads(f"[{raw.stdout.replace('}{', '},{')}]")

# Call 2: get only watchtower-enabled containers via filter
wt_raw = subprocess.run(
    ["docker", "ps", "-a",
     "--filter", "label=com.centurylinklabs.watchtower.enable=true",
     "--format", "{{.Names}}"],
    capture_output=True, text=True
)
wt_names = set(wt_raw.stdout.strip().split("\n"))

Now container status updates take milliseconds instead of seconds.

Deprecated `asyncio.get_event_loop()` in Python 3.10+

The aide_resolve endpoint used the deprecated asyncio.get_event_loop() inside an async function. Python 3.10+ raises warnings and future errors.

# Old: fails in Python 3.10+
loop = asyncio.get_event_loop()
result = loop.run_until_complete(asyncio.gather(...))

The fix is to use asyncio.get_running_loop() when already in a coroutine.

# New: works everywhere
loop = asyncio.get_running_loop()
result = await asyncio.gather(...)

One line changed, no more deprecation warnings.

Missing Disk and Tor Exit IP Caches

parse_disk() ran du -sb /ai/models on every poll (every 5 seconds) and scanned 66 GB recursively. check_tor() rebuilt a Tor connection on every poll too. Both operations took up to 20 seconds, and since asyncio.gather() waits for all tasks, the response time was still dominated by the slowest check.

# Original: scans disk every 5 seconds
def parse_disk():
    cmd = ["du", "-sb", "/ai/models"]
    return subprocess.run(cmd, capture_output=True, text=True).stdout.strip()

I added simple TTL caches to avoid repeated work.

_disk_cache: dict | None = None
_disk_cache_ts: float = 0

def parse_disk() -> dict:
    global _disk_cache, _disk_cache_ts
    if time.time() - _disk_cache_ts < 60 and _disk_cache is not None:
        return _disk_cache
    cmd = ["du", "-sb", "/ai/models"]
    raw = subprocess.run(cmd, capture_output=True, text=True).stdout.strip()
    _disk_cache = {"size": raw}
    _disk_cache_ts = time.time()
    return _disk_cache

Same pattern for Tor exit IP, but with a 120-second TTL since IP changes happen every 10 minutes.

_tor_exit_ip: str | None = None
_tor_exit_ts: float = 0

def check_tor() -> dict:
    global _tor_exit_ip, _tor_exit_ts
    if time.time() - _tor_exit_ts < 120 and _tor_exit_ip is not None:
        return {"status": "active", "exit_ip": _tor_exit_ip}
    # ... torsocks curl ...
    _tor_exit_ip = exit_ip
    _tor_exit_ts = time.time()
    return {"status": "active", "exit_ip": exit_ip}

Disk polling dropped from 20 seconds to under 100ms. Tor connection reuse cut daily Tor traffic by 90%.

ProtectSystem Blocking AIDE Updates

The systemd unit for the dashboard service had ProtectSystem=strict, which made the entire filesystem read-only for child processes. Even though sudo /data/scripts/aide-resolve.sh worked in a terminal, it failed inside the service with EROFS: Read-only file system when writing /var/lib/aide/aide.db.new.

# Symptom: fails with EROFS even with sudo
sudo /data/scripts/aide-resolve.sh
# EROFS: Read-only file system

The fix was to explicitly allow the directories AIDE needs.

# /etc/systemd/system/grid-dashboard.service
[Service]
ReadWritePaths=/data /var/log /var/lib/tor /var/lib/aide

After reloading systemd and restarting the service, AIDE updates worked again.

sudo systemctl daemon-reload
sudo systemctl restart grid-dashboard

Frontend Polling Stacking and Variable Shadowing

The dashboard frontend polled /api/status every 5 seconds. When the endpoint took longer than 5 seconds, requests stacked up, overwhelming the browser and API alike.

// Original: no guard, requests stack
setInterval(() => {
  fetch("/api/status").then(...);
}, 5000);

Added a mutex using useRef(false) to prevent overlapping requests.

const fetchingRef = React.useRef(false);
const fetchStatus = async () => {
  if (fetchingRef.current) return;
  fetchingRef.current = true;
  try {
    const res = await fetch("/api/status");
    // ...
  } finally {
    fetchingRef.current = false;
  }
};

Also fixed a variable shadowing bug where sv in hs.filter(sv => sv.category === cat) shadowed sv = data?.services from the outer scope. Renamed the inner variable to svc.

What I Actually Use

DGX Spark: runs the FastAPI dashboard and all AI services locally

Mistral Small 4: the language model powering the dashboard’s AI features

systemd: manages services with precise resource controls

Flow

Load-Induced Bugs

Four issues exposed under stress and their fixes

Async Endpoint Blocking Blocking calls in async endpoint

Diagnosis Event loop blocked by subprocess.run()

Fix Use asyncio.gather with executors

Result Response time reduced from 40s to 2-3s

Docker Inspect Sprawl N+1 calls for container checks

Diagnosis Subprocess overhead per container

Fix Single docker inspect with jq filter

Result Latency independent of container count

Systemd Protection AIDE updates blocked by ProtectSystem

Diagnosis Security policy preventing updates

Fix Adjust systemd service configuration

Result AIDE updates function correctly

Frontend Polling Unbounded request accumulation

Diagnosis Polling requests piling up under load

Fix Implement rate limiting and debouncing

Result Stable frontend performance