Four Bugs That Only Showed Up Under Load: Fixing a FastAPI Dashboard
Quick Take
- Async FastAPI endpoints running synchronously under load
- Docker inspect calls multiplying like rabbits
- systemd’s ProtectSystem eating your AIDE updates
- Frontend polling requests piling up like unpaid invoices
The dashboard worked fine when I wasn’t using it. Then I fired up Locust and watched the whole thing collapse under 50 concurrent users. Four separate bugs, all invisible until the system was stressed. Here’s exactly what broke, why it broke, and how I fixed each one.
Async Endpoint Blocking the Event Loop
The /api/status endpoint was marked async def but every check ran synchronously. subprocess.run() blocked the uvicorn event loop completely. When du -sb /ai/models timed out at 20 seconds, nothing else could process, not even /api/aide/resolve.
@app.get("/api/status")
async def get_status(_: bool = Depends(verify_token)):
# This blocks the event loop for 20+ seconds
gpu = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
mem = parse_memory()
disk = parse_disk() # 20s timeout
# ...
The fix was straightforward: run blocking calls in executors and await them all in parallel.
@app.get("/api/status")
async def get_status(_: bool = Depends(verify_token)):
loop = asyncio.get_running_loop()
gpu, mem, disk, tor, ufw, containers, f2b, aide, butler_st, hidden, llm, wt = await asyncio.gather(
loop.run_in_executor(None, parse_nvidia_smi),
loop.run_in_executor(None, parse_memory),
loop.run_in_executor(None, parse_disk),
loop.run_in_executor(None, check_tor),
loop.run_in_executor(None, check_ufw),
loop.run_in_executor(None, check_containers),
loop.run_in_executor(None, check_fail2ban),
loop.run_in_executor(None, check_aide),
loop.run_in_executor(None, check_butler_status),
loop.run_in_executor(None, check_hidden_services),
loop.run_in_executor(None, check_llm_service),
loop.run_in_executor(None, check_watchtower),
)
return {
"gpu": gpu,
"memory": mem,
"disk": disk,
"tor": tor,
"firewall": ufw,
"containers": containers,
"fail2ban": f2b,
"aide": aide,
"butler": butler_st,
"hidden": hidden,
"llm": llm,
"watchtower": wt,
}
Now the response time equals the longest running check instead of the sum of all checks. Under load, it’s 2-3 seconds instead of 40.
N+1 Docker Calls in Container Status Checks
The check_containers() function called docker inspect once per container to check Watchtower labels. Ten containers meant eleven subprocesses just to build the status list. Under load, this multiplied latency by the number of containers.
for container in containers:
inspect = subprocess.run(
["docker", "inspect", container, "--format", '{{index .Config.Labels "com.centurylinklabs.watchtower.enable"}}'],
capture_output=True, text=True
)
if "true" in inspect.stdout:
watchtower_containers.append(container)
I tried using --format with Go template indexing, but Docker’s .Labels isn’t a map in Go terms.
# This fails with "cannot index slice/array with type string"
docker ps --format '...,"watchtower":"{{index .Labels "com.centurylinklabs.watchtower.enable"}}"...'
The real fix was two calls instead of N+1.
# Call 1: get all container names and statuses in one shot
raw = subprocess.run(
["docker", "ps", "-a", "--format", '{"name":"{{.Names}}","status":"{{.Status}}"...}'],
capture_output=True, text=True
)
containers = json.loads(f"[{raw.stdout.replace('}{', '},{')}]")
# Call 2: get only watchtower-enabled containers via filter
wt_raw = subprocess.run(
["docker", "ps", "-a",
"--filter", "label=com.centurylinklabs.watchtower.enable=true",
"--format", "{{.Names}}"],
capture_output=True, text=True
)
wt_names = set(wt_raw.stdout.strip().split("\n"))
Now container status updates take milliseconds instead of seconds.
Deprecated asyncio.get_event_loop() in Python 3.10+
The aide_resolve endpoint used the deprecated asyncio.get_event_loop() inside an async function. Python 3.10+ raises warnings and future errors.
# Old: fails in Python 3.10+
loop = asyncio.get_event_loop()
result = loop.run_until_complete(asyncio.gather(...))
The fix is to use asyncio.get_running_loop() when already in a coroutine.
# New: works everywhere
loop = asyncio.get_running_loop()
result = await asyncio.gather(...)
One line changed, no more deprecation warnings.
Missing Disk and Tor Exit IP Caches
parse_disk() ran du -sb /ai/models on every poll (every 5 seconds) and scanned 66 GB recursively. check_tor() rebuilt a Tor connection on every poll too. Both operations took up to 20 seconds, and since asyncio.gather() waits for all tasks, the response time was still dominated by the slowest check.
# Original: scans disk every 5 seconds
def parse_disk():
cmd = ["du", "-sb", "/ai/models"]
return subprocess.run(cmd, capture_output=True, text=True).stdout.strip()
I added simple TTL caches to avoid repeated work.
_disk_cache: dict | None = None
_disk_cache_ts: float = 0
def parse_disk() -> dict:
global _disk_cache, _disk_cache_ts
if time.time() - _disk_cache_ts < 60 and _disk_cache is not None:
return _disk_cache
cmd = ["du", "-sb", "/ai/models"]
raw = subprocess.run(cmd, capture_output=True, text=True).stdout.strip()
_disk_cache = {"size": raw}
_disk_cache_ts = time.time()
return _disk_cache
Same pattern for Tor exit IP, but with a 120-second TTL since IP changes happen every 10 minutes.
_tor_exit_ip: str | None = None
_tor_exit_ts: float = 0
def check_tor() -> dict:
global _tor_exit_ip, _tor_exit_ts
if time.time() - _tor_exit_ts < 120 and _tor_exit_ip is not None:
return {"status": "active", "exit_ip": _tor_exit_ip}
# ... torsocks curl ...
_tor_exit_ip = exit_ip
_tor_exit_ts = time.time()
return {"status": "active", "exit_ip": exit_ip}
Disk polling dropped from 20 seconds to under 100ms. Tor connection reuse cut daily Tor traffic by 90%.
ProtectSystem Blocking AIDE Updates
The systemd unit for the dashboard service had ProtectSystem=strict, which made the entire filesystem read-only for child processes. Even though sudo /data/scripts/aide-resolve.sh worked in a terminal, it failed inside the service with EROFS: Read-only file system when writing /var/lib/aide/aide.db.new.
# Symptom: fails with EROFS even with sudo
sudo /data/scripts/aide-resolve.sh
# EROFS: Read-only file system
The fix was to explicitly allow the directories AIDE needs.
# /etc/systemd/system/grid-dashboard.service
[Service]
ReadWritePaths=/data /var/log /var/lib/tor /var/lib/aide
After reloading systemd and restarting the service, AIDE updates worked again.
sudo systemctl daemon-reload
sudo systemctl restart grid-dashboard
Frontend Polling Stacking and Variable Shadowing
The dashboard frontend polled /api/status every 5 seconds. When the endpoint took longer than 5 seconds, requests stacked up, overwhelming the browser and API alike.
// Original: no guard, requests stack
setInterval(() => {
fetch("/api/status").then(...);
}, 5000);
Added a mutex using useRef(false) to prevent overlapping requests.
const fetchingRef = React.useRef(false);
const fetchStatus = async () => {
if (fetchingRef.current) return;
fetchingRef.current = true;
try {
const res = await fetch("/api/status");
// ...
} finally {
fetchingRef.current = false;
}
};
Also fixed a variable shadowing bug where sv in hs.filter(sv => sv.category === cat) shadowed sv = data?.services from the outer scope. Renamed the inner variable to svc.
What I Actually Use
- DGX Spark: runs the FastAPI dashboard and all AI services locally
- Mistral Small 4: the language model powering the dashboard’s AI features
- systemd: manages services with precise resource controls
Load-Induced Bugs
Four issues exposed under stress and their fixes