Four Bugs That Only Showed Up Under Load: Fixing a FastAPI Dashboard
Quick Take
- Four separate bugs surfaced only under load in a FastAPI dashboard
- Each failure cost 5-20 seconds of blocked I/O and CPU time
- All four fixes are in production now and cut API latency from 22 s to 1.4 s
The dashboard looked fast in the browser until ten users hit it at once. Then everything froze. Not because the code was wrong, but because it was written like synchronous shell scripts inside async endpoints. Four independent failures compounded into a single 22-second response time. Here’s exactly what broke, why it broke, and how to fix it.
Event-Loop Blocking in get_status()
The /api/status endpoint is async, but every status check runs synchronously with subprocess.run(). That blocks the uvicorn event loop for the entire duration of the slowest check. Last week this failed because a disk scan on a 66 GB model directory took 20 seconds, and no other request could be processed during that time.
subprocess.run() is synchronous by design. When you call it inside an async function, Python hands the work to the OS and waits. The asyncio event loop can’t schedule another coroutine until the shell command finishes. That means every /api/status call freezes the entire dashboard until the disk scan, Docker list, and NVIDIA query all complete in sequence.
@app.get("/api/status")
async def get_status(_: bool = Depends(verify_token)):
# These four commands block the event loop for 20+ seconds
gpu = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
mem = subprocess.run(["free", "-h"], capture_output=True, text=True)
disk = subprocess.run(["du", "-sb", "/ai/models"], capture_output=True, text=True)
tor = subprocess.run(["torsocks", "curl", "https://check.torproject.org/api/ip"],
capture_output=True, text=True)
return {"gpu": gpu.stdout, "mem": mem.stdout, "disk": disk.stdout, "tor": tor.stdout}
The fix is to move all blocking I/O into the thread pool with loop.run_in_executor() and run them in parallel. The event loop stays free to handle other requests while the shell commands execute.
@app.get("/api/status")
async def get_status(_: bool = Depends(verify_token)):
loop = asyncio.get_running_loop()
gpu, mem, disk, tor = await asyncio.gather(
loop.run_in_executor(None, lambda: subprocess.run(["nvidia-smi"],
capture_output=True, text=True)),
loop.run_in_executor(None, lambda: subprocess.run(["free", "-h"],
capture_output=True, text=True)),
loop.run_in_executor(None, lambda: subprocess.run(["du", "-sb", "/ai/models"],
capture_output=True, text=True)),
loop.run_in_executor(None, lambda: subprocess.run(["torsocks", "curl",
"https://check.torproject.org/api/ip"],
capture_output=True, text=True))
)
return {"gpu": gpu.stdout, "mem": mem.stdout, "disk": disk.stdout, "tor": tor.stdout}
Now the response time equals the slowest check instead of the sum of all checks. In practice the endpoint dropped from 22 s to 1.4 s under load.
N+1 Docker Calls in check_containers()
The original check_containers() function ran one docker inspect per container to read the Watchtower label. Ten containers meant eleven separate subprocess calls. Last week this failed because the dashboard started timing out when the container count crossed eight.
Docker’s --format flag doesn’t support Go map indexing, so the naive approach fails:
docker ps -a --format '{{index .Labels "com.centurylinklabs.watchtower.enable"}}'
# "can't index slice/array with type string"
The fix is to split the work into two calls: one lightweight list of all containers, and a filtered list of only the Watchtower-enabled ones.
def check_containers():
# Call 1: fast list of every container
raw = subprocess.run(["docker", "ps", "-a",
"--format", '{"name":"{{.Names}}","status":"{{.Status}}"}]'],
capture_output=True, text=True).stdout
containers = json.loads(f"[{raw.replace('}{', '},{')}]")
# Call 2: only containers with the label
wt_raw = subprocess.run(["docker", "ps", "-a",
"--filter", "label=com.centurylinklabs.watchtower.enable=true",
"--format", "{{.Names}}"],
capture_output=True, text=True).stdout
wt_names = set(wt_raw.strip().split("\n"))
for c in containers:
c["watchtower"] = c["name"] in wt_names
return containers
This reduces the Docker overhead from O(n) subprocess calls to two calls regardless of container count.
Deprecated asyncio.get_event_loop() in aide_resolve()
Python 3.10+ deprecates asyncio.get_event_loop() inside a running coroutine. The aide_resolve endpoint used it to spawn a subprocess, which broke under systemd because the event loop wasn’t accessible.
# BROKEN on Python 3.10+
@app.post("/api/aide/resolve")
async def aide_resolve():
loop = asyncio.get_event_loop() # DeprecationWarning
await loop.run_in_executor(None, update_aide_db)
The fix is to use asyncio.get_running_loop() which is safe inside a coroutine.
@app.post("/api/aide/resolve")
async def aide_resolve():
loop = asyncio.get_running_loop() # Correct for Python 3.10+
await loop.run_in_executor(None, update_aide_db)
Missing Disk and Tor Exit-IP Caches
The original code ran du -sb /ai/models and a new Tor connection on every 5-second poll. A 66 GB directory scan can take 20 seconds, and a fresh Tor circuit can take 5 seconds. Because asyncio.gather() waits for all checks, the slowest one still defined the response time.
def parse_disk():
# Re-scans 66 GB every 5 seconds
return subprocess.run(["du", "-sb", "/ai/models"], capture_output=True, text=True).stdout
def check_tor():
# Re-builds Tor circuit every 5 seconds
return subprocess.run(["torsocks", "curl", "https://check.torproject.org/api/ip"],
capture_output=True, text=True).stdout
The solution is to cache the results with a time-to-live. Disk size changes rarely, and Tor exit IP changes every 10 minutes, so a 60-second and 120-second cache is enough.
_disk_cache: str | None = None
_disk_cache_ts: float = 0
_tor_exit_ip: str | None = None
_tor_exit_ts: float = 0
def parse_disk() -> str:
global _disk_cache, _disk_cache_ts
if time.time() - _disk_cache_ts < 60 and _disk_cache is not None:
return _disk_cache
_disk_cache = subprocess.run(["du", "-sb", "/ai/models"],
capture_output=True, text=True).stdout
_disk_cache_ts = time.time()
return _disk_cache
def check_tor() -> str:
global _tor_exit_ip, _tor_exit_ts
if time.time() - _tor_exit_ts < 120 and _tor_exit_ip is not None:
return _tor_exit_ip
_tor_exit_ip = subprocess.run(["torsocks", "curl", "https://check.torproject.org/api/ip"],
capture_output=True, text=True).stdout
_tor_exit_ts = time.time()
return _tor_exit_ip
With these caches the disk and Tor checks take microseconds instead of seconds, and the dashboard stays responsive.
ProtectSystem=strict Blocking AIDE —update
The systemd unit for the dashboard used ProtectSystem=strict, which mounts the root filesystem as read-only for all child processes. That blocked sudo /data/scripts/aide-resolve.sh from writing /var/lib/aide/aide.db.new, even though the script ran as root.
# /etc/systemd/system/grid-dashboard.service
[Service]
ProtectSystem=strict
The symptom is a clear EROFS error when the script tries to update the AIDE database.
sudo /data/scripts/aide-resolve.sh
# EROFS: Read-only file system: '/var/lib/aide/aide.db.new'
The fix is to whitelist the directories AIDE needs to write.
# /etc/systemd/system/grid-dashboard.service
[Service]
ProtectSystem=strict
ReadWritePaths=/data /var/log /var/lib/tor /var/lib/aide
After reloading systemd and restarting the service, AIDE updates work again.
sudo systemctl daemon-reload
sudo systemctl restart grid-dashboard
Frontend Poll Stacking and Variable Shadowing
When /api/status took longer than the 5-second frontend poll, the browser stacked requests. Each new poll fired before the previous one finished, creating a snowball of overlapping fetches.
// Original code: no guard against in-flight requests
const fetchStatus = async () => {
const res = await fetch("/api/status");
setData(await res.json());
};
setInterval(fetchStatus, 5000);
The fix is to use a ref as a mutex so only one poll runs at a time.
const fetchingRef = { current: false };
const fetchStatus = async () => {
if (fetchingRef.current) return;
fetchingRef.current = true;
try {
const res = await fetch("/api/status");
setData(await res.json());
} finally {
fetchingRef.current = false;
}
};
setInterval(fetchStatus, 5000);
A second bug was variable shadowing in the service list filter. The outer scope used sv for the services array, and the inner arrow function redefined sv for each service, breaking the filter.
// BROKEN: shadowing sv
const filtered = data?.services.filter(sv => sv.category === cat);
Renaming the inner variable fixed the filter.
// FIXED: renamed to svc
const filtered = data?.services.filter(svc => svc.category === cat);
What I Actually Use
- FastAPI 0.111 with async endpoints and
loop.run_in_executor- systemd units with explicit
ReadWritePathsinstead ofProtectSystem=strict- React frontend with
useRefguards to prevent poll stacking
Async Blocking Fix
Resolving event-loop blocking in FastAPI under load