Upstream

Bugs found while running this stack get filed and patched at the source. Tools extracted from this stack get released under permissive licenses. Both halves below (releases first, then the contributions to other projects), with links to code, upstream issues/PRs, and the blog posts that document the diagnosis. Identity for all of these: github.com/cipherfoxie.

5released open-source

24contributions

3merged

14open

19projects

Snapshot from GitHub API on 2026-06-20. Refreshed each deploy.

Released open-source

The other half of upstream: tools and servers extracted from this stack and released under permissive licenses. 5 so far.

agent-bench

benchmark · MIT-licensed · dependency-free Node 22 · deterministic gates · write-up

An A/B harness that measures whether an MCP server or skill actually improves a coding agent: baseline arm vs treatment arm on your own models, scored by build gates, typechecks, and frozen fact checklists instead of LLM judges. Negative results are published by design. First case studies: Serena (a guardrail, not a turbo) and the caveman skill (claim 75%, measured 31%, never cheaper in dollars).
sovereign-mcp

mcp-server · MIT-licensed · FastMCP 1.27.0 · Python 3.12 · write-up

A Streamable-HTTP MCP server that exposes the engineering log as four structured tools (search_blog, get_article, list_tags, diagnose_sglang). Listed on the official MCP registry as org.sovgrid/self-hosted-ai with DNS-auth (ed25519), Smithery 100/100, Glama, and the canonical awesome-mcp-servers index. Reference deployment: mcp.sovgrid.org/self-hosted-ai.
vps-healthcheck

ops · MIT-licensed · ~250 LOC bash · no daemon · write-up

Daily SSH-based VPS audit: twelve checks bundled into one SSH session, single notification push on regression, exit 0/1/2 for cron. The README walks through why this fits the one-or-two-VPS operator who does not want to monitor the monitoring (versus Prometheus / Netdata / Uptime Kuma).
watchdocker

devops · MIT-licensed · ~350 LOC bash · systemd-timer, no daemon · write-up

A bash-native successor to the archived watchtower: a systemd timer pulls and smart-restarts Docker Compose stacks once a week, then gets out of the way, no 24/7 container, no open socket. Built after watchtower 1.7.1 broke against Docker Engine 29.5. The write-up compares it honestly against the surviving forks (nicholas-fedor/watchtower, WUD) and is explicit about which operator each one fits.
sovereign-backup

backup · MIT-licensed · pure bash · age-encrypted · systemd timer · write-up

A config-driven, multi-host backup tool: tar plus age encryption to a USB stick or local NVMe, driven by per-host YAML, with a systemd timer and a dry-run-smoke-tested CLI. One generic binary across hosts; the real source list and destinations live in config, not in the script.

Contributions to other projects

PR merged spark-arena/community-recipe-registry #6 2026-06-12

Add qwen3.6-35b-a3b AutoRound int4-mixed + DFlash recipe (69.2 tok/s measured, production config)

Submitted the production Qwen3.6-35B AutoRound int4-mixed + DFlash k=3 config as a community recipe: 69.2 tok/s single-stream decode (prefill-separated, cross-checked against vLLM's TPOT metric), 18/18 on a deterministic coding gate, vision tower available behind an extra_flags toggle. Merged same day. Anyone with a DGX Spark can now run the exact daily-driver config behind this blog with one sparkrun command.

Read the full diagnosis on the blog →

PR merged getAlby/bitcoin-connect #385 2026-05-05

docs: clear window.webln on disconnect (closes #215)

Issue #215 (filed 2024-05-06) flagged that the README's documented pattern for assigning the provider to window.webln in onConnected leaves a stale provider on the global after disconnect, the saved reference outlives the wallet connection. Library can't own window.webln (would conflict with consumer expectations + other extensions), so the right fix is documentation. PR extends the WebLN global object example to pair onConnected with onDisconnected and delete the global. Merged 2026-05-07.

Read the full diagnosis on the blog →

PR merged punkpeye/awesome-mcp-servers #5645 2026-04-30

Add cipherfoxie/sovereign-mcp to Knowledge & Memory 🤖🤖🤖

Listed the Sovereign AI MCP under Knowledge & Memory in the canonical awesome-mcp-servers index. Part of a four-registry distribution push (Smithery, Glama Connector, Glama Server, awesome-mcp) executed in a single afternoon. Merged within hours of submission.

Read the full diagnosis on the blog →

Issue open intel/auto-round #1919 2026-06-12

Production report: Qwen3.6-35B-A3B int4-mixed replaced a 4.75-bit build on DGX Spark (GB10), +12.7% decode, no quality loss, vision intact

Production field report to the quant authors: their int4-mixed build of Qwen3.6-35B replaced a 4.75-bit compressed-tensors build as this stack's resident model. +12.7% decode under the same ruler, identical 18/18 on the deterministic coding gate, and of the three quants compared it was the only one that kept a working vision tower. Most quant feedback is benchmarks; this is a deployment decision with the receipts.

Read the full diagnosis on the blog →

Discussion open oraios/serena #1573 2026-06-12

Measured Serena on two self-hosted models with deterministic gates: no effect on a strong model, but it rescued a weak one on the dangerous task

Shared the deterministic-gate measurement with the maintainers: Serena rescued a weak model on the dangerous ambiguous-rename task (0/3 to 1/3 correct, collateral damage down from 8 files to 1.7) while a strong model never needed it and paid +15-158% input tokens in tool-schema overhead. Net: a semantic guardrail for weak models, a tax on strong ones, and a concrete optimization target (schema size) for the project.

Read the full diagnosis on the blog →

Discussion open JuliusBrussee/caveman #520 2026-06-12

Independent measurement across 5 models: output reduction is real (~-31%) but total cost never dropped; the 65-75% claim does not reproduce

Took the 65-75% token-reduction claim of a ~200k-install skill to five models with deterministic gates. What reproduces: ~-31% output tokens on chat answers. What does not: the headline claim, on any model; the instruction rides as ~1k input tokens per request, so total cost never dropped, and one model's outputs got 18% longer. Suggested an honest re-scoping of the claim rather than its removal.

Read the full diagnosis on the blog →

Issue open z-lab/dflash #135 2026-06-12

Docs suggestion: cold-start TPOT undersells steady-state by ~35% (43 vs 69 tok/s) until the draft path warms

Docs suggestion from a measurement trap caught in production: DFlash's speculative-decoding acceptance rate needs a real request window to warm up, so a cold TPOT reading right after launch undersells steady-state by ~35% (43 vs 69 tok/s on the same unchanged setup). Exactly the kind of number that fuels wrong 'X is slow' reports. Suggested one README line: benchmark after warmup.

Read the full diagnosis on the blog →

Issue open spark-arena/recipe-registry #17 2026-06-12

Independent reproduction: gpt-oss-120b single-Spark 58.82 tok/s confirmed (59.5 measured), plus two notes on the nightly image

Independent reproduction of the verified gpt-oss-120b single-Spark leaderboard entry: 59.5 tok/s measured against their published 58.82, with a GPU-clock-lock check confirming the number is a memory-bandwidth ceiling rather than a tuning artifact. Plus two documented traps in the nightly-image path (44-minute compile hang that freezes unified-memory boxes, MARLIN MXFP4 ballooning past the memory cap) and a request to publish the prebuilt CUTLASS image.

Read the full diagnosis on the blog →

Issue open spark-arena/dgx-vllm #3 2026-06-10

Measured +1 for the --exp-mxfp4 prebuilt image request (comment)

Issue open vllm-project/vllm #43969 2026-05-29

gpt-oss-120b MXFP4 on GB10: MARLIN allocates past gpu-memory-utilization; CUTLASS source build resolves it (diagnostic comment)

Added three diagnostic data points to the existing unified-memory ARM OOM report: the MARLIN MXFP4 MoE path reaching ~118 GB despite a 0.7 gpu-memory-utilization cap, the FlashInfer MXFP4 kernel rejecting SM121A outright, and a source-built CUTLASS + FlashInfer image (arch 12.1a) that respects the cap at ~61 GB and serves at 59.5 tok/s. Narrows the bug to memory accounting in the MARLIN fallback rather than weights or hardware.

Read the full diagnosis on the blog →

PR open vite-pwa/docs #192 2026-05-05

docs(deployment): fix grammar in AWS Amplify WIP placeholder

Re-submission of vite-plugin-pwa #930 after maintainer @userquin closed the original (wrong-repo: docs moved to this dedicated repository). The same 'Will coming soon' typo found in deployment/aws.md — 'will' takes a bare infinitive, not a present participle. 1-line fix to 'Coming soon'. Two-step engagement pattern: (1) initial PR in wrong repo, (2) maintainer redirects, (3) re-submit cleanly + offer follow-up cleanup of orphan placeholder files in the legacy repo.

Issue open OpenHands/OpenHands #14287 2026-05-04

[Bug]: RecallAction with EventSource.USER after a user message produces two consecutive USER turns, breaks Mistral / SGLang strict-alternation backends

AgentController._handle_message_action dispatches a RecallAction with EventSource.USER after every user message. The recall lands as a synthetic USER turn, so the LLM request carries two consecutive user messages and any Mistral-template-strict backend rejects it with HTTP 400 on prompt 1 of every session. Three suggested fixes plus the controller-patch workaround that operators run today.

Read the full diagnosis on the blog →

Issue open mistralai/mistral-vibe #667 2026-05-04

[Bug]: Context-window overflow during write_file silently drops file content, agent reports 'committed' on a corrupted file

Under context pressure the model emits a partial-rewrite that drops chunks of the original file, but the tool-call success shape is unchanged so the write commits silently. Suggested handling: pre-commit size-delta check, an apply_patch primitive bounded to the section being edited, and a context-overflow flag that an MCP can refuse on.

Read the full diagnosis on the blog →

PR open mistralai/mistral-vibe #666 2026-05-04

fix(generic): default delta.role to 'assistant' for streaming chunks

OpenAIAdapter._parse_message rejected streaming chunks 2+ because delta.role is null per OpenAI spec but LLMMessage.role is a required enum without default. TUI failed on the first user prompt of every session. Three-line defensive default, plus a suggestion for a cleaner upstream form (Optional role with default).

Read the full diagnosis on the blog →

Issue open mistralai/mistral-vibe #665 2026-05-04

OpenAIAdapter._parse_message rejects streaming chunks 2+ when delta.role is null (TUI fails on prompt 1)

Reproduction with raw SGLang chunk dump, two falsified hypotheses (a 2.9.3 streaming-parser regression and a Pydantic 2.13 enum-strictness change) before the actual root cause was traced to _parse_message.

Read the full diagnosis on the blog →

Issue open vllm-project/vllm #37431 2026-03-18

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1

Cross-pollination comment from adjacent inference engines on the same DGX Spark / GB10 hardware: SGLang nightly with Mistral Small 4 NVFP4 + vLLM for Voxtral. Pointed maintainers to sibling sgl-project/sglang#21085 and the two open SGLang PRs attempting fixes (#21558, #24269). Reinforced @EmilHaase's finding that native causal-conv1d works on SM121 but Triton-generated equivalents fail — the bug is in the Triton-codegen-for-SM121 path specifically, not the substrate. Offered to test the CUDA_LAUNCH_BLOCKING=1 + --enforce-eager workaround on this Spark and cross-check whether vLLM's Triton SSM ops fail the same way SGLang's flashinfer-GDN path does. Asked for a minimal Mamba-2 repro that avoids the 120B Nemotron-3 weights.

Issue open gpustack/gpustack #3456 2025-11-23

`hf download` exits 0 while leaving `.incomplete` blobs after xet 416 CAS errors

`hf download` exits with status 0 while leaving `.incomplete` blobs in the cache when xet 416 CAS errors occur on flaky upstream connectivity. Distinct from #3960 (process killed): in our case the exit code lies, only filesystem state is the ground truth. Repro covers DGX Spark (GB10 Blackwell SM12.1, ARM v9.2-A) on three large models (Voxtral-4B-TTS-2603, VibeVoice-Large, Qwen3.6-35B-A3B-PrismaQuant). Three independently-effective workarounds documented (HF_HUB_DISABLE_XET=1, HF_HUB_DOWNLOAD_TIMEOUT=300, filesystem .incomplete check), plus the 347-LOC hf-pull wrapper combining all three with exponential-backoff retries. Resolved and closed 2026-06-11: maintainer @Wauplin reproduced the exit-0-on-truncation deterministically and confirmed the exit-code-vs-filesystem-state mismatch was the real bug. Root cause in hf_xet 1.4.2: a spurious within-file HTTP 416 was treated as end-of-file, and the partial byte count was returned as success without comparing it to the expected size. Fixed in hf_xet >= 1.5.1 via xet-core #716 (bounds the reconstruction range to the known file size) and #735 (adds a downloaded-vs-expected size check that turns a short read into a hard error instead of a silent exit 0); huggingface_hub #4306 further reworks the cache write path so partial files can never be committed in place.

Read the full diagnosis on the blog →

PR closed gpustack/gpustack #5274 2026-05-07

fix: thread-safe tqdm progress hijack + cleanup hook on close (#3456)

Issue #3456 (open since 2025-11-23, no maintainer reply) reports that GPUStack's HuggingFace model-download progress shows wrong percentages until container restart. Reading model_file_manager.py revealed a cluster of seven open issues (#3547, #3833, #3853, #3882, #5253, #2734) all touching the same monkey-patched tqdm logic. Three compounding bugs identified: thread-unsafe _handle_tqdm_init while HfDownloader runs thread_map(max_workers=8), no tqdm.close hook so _file_line_mapping accumulates stale entries (which is exactly what container-restart resets), and a separate two-tqdm-instances-per-file pattern visible in the reporter's logs. PR scope: bugs A and C only (lock the init mutations + add a close hook). Iterated once after gemini-code-assist flagged that _assign_file_basename was still outside the lock; moved it inside in the second commit. Two-tqdm-per-file behavior left for a follow-up PR with a local reproducer (HF chunked-download interrupt mid-transfer). Engagement model: identified the cluster first to give the maintainer a one-PR-fixes-seven-issues hook rather than a single bug report.

PR closed vite-pwa/vite-plugin-pwa #930 2026-05-05

docs(astro): fix grammar in Astro WIP placeholder (closed: wrong repo)

Closed by maintainer @userquin: the framework docs pages in this repository are orphan placeholders, the canonical docs moved to vite-pwa/docs. PR re-submitted there as #192. While engaging on the close, I asked the maintainer whether the orphan files (astro.md + iles.md, both <12 lines) should be removed entirely from this repo to prevent search-engine landings on stale content. Engagement model worked even on a closed PR: same-day maintainer response, helpful redirect, opening for follow-up cleanup PR.

PR closed getAlby/bitcoin-connect #384 2026-05-05

fix: fire webln:enabled event on successful connect (closes #82)

Bitcoin Connect listens to the webln:enabled event in boot.ts to detect external WebLN providers, but never dispatches it itself when one of its own connectors finishes connecting. PR dispatched the event after a successful connect in store.ts and tightened the listener guard in boot.ts. Closed by maintainer rolznz on 2026-05-06 with a design clarification: BC deliberately does not own window.webln, consumers manage the provider locally via onConnected, so firing the global event would mislead spec-following callers that read window.webln directly. Useful outcome anyway — the closure surfaced the documented ownership model and led to companion docs PR #385.

Read the full diagnosis on the blog →

Issue closed openclaw/openclaw #77336 2026-05-04

[Feature Request]: Built-in handling of strict role alternation for Mistral / SGLang backends

OpenClaw's request shape produces consecutive user messages in normal multi-turn flows, which Mistral's strict-alternation chat template rejects with HTTP 400. Filed as a feature request with two suggested in-repo approaches (provider capability flag, or pre-send hook on the Mistral plugin). Closed not-planned on 2026-06-01: migrating my own stack to Qwen removed the strict-alternation constraint, so I am no longer blocked and won't be pursuing the PR. The side-car proxy remains a documented workaround for anyone still serving Mistral via SGLang.

Read the full diagnosis on the blog →

Why this page exists

Every fix that ships in production here started as a bug somewhere upstream. Sending the fix back is the honest move and produces a more durable record than a private patch. This page is also the consolidated public-facing index of contributions, easier than scrolling GitHub profile activity.

If a contribution is open and you have an interest in it landing, react on the GitHub issue or PR directly. The blog posts give the engineering context, the upstream tracker is where the merge decision happens.

Open issues (local, not yet upstream)

Issues found while running this stack that have not yet been filed upstream or that are local-design problems with no obvious upstream target. Listed here so the gap is visible, not hidden. The same Engineering Honesty Rule 6 commitment as the manifesto: publish corrections, not just the original claim.

flashinfer on GB10. OOM on first batch. Workaround: --attention-backend triton. SM121A is not supported in the flashinfer build that ships with the current nightly tag. Upstream issue not yet filed; the workaround is stable enough that the priority has stayed low.
Sequential-only GPU services. The LLM, TTS, and image services cannot share the 128 GB unified pool. Coordination via the CLI mutex (switch.sh qwen|mistral|none|status) with a 60-second guard between transitions, runnable from Termux. The manual context-switching cost for the operator has not been fully eliminated.
OpenClaw streaming watchdog reset. When switching to a new Anthropic model mid-session, the first stream sometimes drops at the 30-second watchdog. Workaround: send a new message; the next stream resyncs.
Hero image diversity. The image model defaults to recurring visual metaphors across articles. The pipeline now uses per-style visual vocabularies and a rolling motif blacklist to break the pattern. Improvement is ongoing; the 32-article drop on 2026-05-27 had 8 of 16 prompts collide on industrial-workshop vocabulary because the blacklist was global-recent rather than within-batch.
reasoning_tokens reporting (SGLang/Mistral fallback path only). Always reports reasoning_tokens: 0 even when reasoning is active. The reasoning_content field in the response is populated correctly. Known SGLang reporting bug, not a model issue. Primary production stack is vLLM/Qwen, unaffected.
Web-search-grounded TLA translation in podcast scripts. The Hexabella role in the podcast pipeline needs realtime web-search lookups for three-letter acronyms with definitions plus concrete examples. Architecture decision pending: pre-generation search-pass (deterministic, more latency) versus inline tool-calls during local-model generation (flexible, less reproducible). Both approaches need a dedicated plan doc before implementation.
Image-prompt collision on multi-article batches. When generating hero images for many articles in one pipeline run, the motif blacklist tracks only globally-recent motifs, not what was already chosen earlier in the same batch. Tracked as a per-batch motif-rotation fix.
Voxtral-4B TTS expressivity ceiling. Spot-listen test 0/10 with the current model. Two failure modes (filler hallucination on short clips, flat staccato on long ones) and no sweet spot. V7 retraining plan abandoned. Pivot spike to VibeVoice / Higgs Audio v2 / IndexTTS-2 not yet executed. Podcast pipeline remains on Voxtral until the spike completes.