Upstream

Bugs found while running this stack get filed and patched at the source. Tools extracted from this stack get released under permissive licenses. Both halves below (releases first, then the contributions to other projects), with links to code, upstream issues/PRs, and the blog posts that document the diagnosis. Identity for all of these: github.com/cipherfoxie.

5released open-source
24contributions
3merged
14open
19projects

Snapshot from GitHub API on 2026-06-20. Refreshed each deploy.

Released open-source

The other half of upstream: tools and servers extracted from this stack and released under permissive licenses. 5 so far.

Contributions to other projects

PR merged spark-arena/community-recipe-registry #6

Add qwen3.6-35b-a3b AutoRound int4-mixed + DFlash recipe (69.2 tok/s measured, production config)

Submitted the production Qwen3.6-35B AutoRound int4-mixed + DFlash k=3 config as a community recipe: 69.2 tok/s single-stream decode (prefill-separated, cross-checked against vLLM's TPOT metric), 18/18 on a deterministic coding gate, vision tower available behind an extra_flags toggle. Merged same day. Anyone with a DGX Spark can now run the exact daily-driver config behind this blog with one sparkrun command.

Read the full diagnosis on the blog →

PR merged getAlby/bitcoin-connect #385

docs: clear window.webln on disconnect (closes #215)

Issue #215 (filed 2024-05-06) flagged that the README's documented pattern for assigning the provider to window.webln in onConnected leaves a stale provider on the global after disconnect, the saved reference outlives the wallet connection. Library can't own window.webln (would conflict with consumer expectations + other extensions), so the right fix is documentation. PR extends the WebLN global object example to pair onConnected with onDisconnected and delete the global. Merged 2026-05-07.

Read the full diagnosis on the blog →

Issue open intel/auto-round #1919

Production report: Qwen3.6-35B-A3B int4-mixed replaced a 4.75-bit build on DGX Spark (GB10), +12.7% decode, no quality loss, vision intact

Production field report to the quant authors: their int4-mixed build of Qwen3.6-35B replaced a 4.75-bit compressed-tensors build as this stack's resident model. +12.7% decode under the same ruler, identical 18/18 on the deterministic coding gate, and of the three quants compared it was the only one that kept a working vision tower. Most quant feedback is benchmarks; this is a deployment decision with the receipts.

Read the full diagnosis on the blog →

Discussion open oraios/serena #1573

Measured Serena on two self-hosted models with deterministic gates: no effect on a strong model, but it rescued a weak one on the dangerous task

Shared the deterministic-gate measurement with the maintainers: Serena rescued a weak model on the dangerous ambiguous-rename task (0/3 to 1/3 correct, collateral damage down from 8 files to 1.7) while a strong model never needed it and paid +15-158% input tokens in tool-schema overhead. Net: a semantic guardrail for weak models, a tax on strong ones, and a concrete optimization target (schema size) for the project.

Read the full diagnosis on the blog →

Discussion open JuliusBrussee/caveman #520

Independent measurement across 5 models: output reduction is real (~-31%) but total cost never dropped; the 65-75% claim does not reproduce

Took the 65-75% token-reduction claim of a ~200k-install skill to five models with deterministic gates. What reproduces: ~-31% output tokens on chat answers. What does not: the headline claim, on any model; the instruction rides as ~1k input tokens per request, so total cost never dropped, and one model's outputs got 18% longer. Suggested an honest re-scoping of the claim rather than its removal.

Read the full diagnosis on the blog →

Issue open z-lab/dflash #135

Docs suggestion: cold-start TPOT undersells steady-state by ~35% (43 vs 69 tok/s) until the draft path warms

Docs suggestion from a measurement trap caught in production: DFlash's speculative-decoding acceptance rate needs a real request window to warm up, so a cold TPOT reading right after launch undersells steady-state by ~35% (43 vs 69 tok/s on the same unchanged setup). Exactly the kind of number that fuels wrong 'X is slow' reports. Suggested one README line: benchmark after warmup.

Read the full diagnosis on the blog →

Issue open spark-arena/recipe-registry #17

Independent reproduction: gpt-oss-120b single-Spark 58.82 tok/s confirmed (59.5 measured), plus two notes on the nightly image

Independent reproduction of the verified gpt-oss-120b single-Spark leaderboard entry: 59.5 tok/s measured against their published 58.82, with a GPU-clock-lock check confirming the number is a memory-bandwidth ceiling rather than a tuning artifact. Plus two documented traps in the nightly-image path (44-minute compile hang that freezes unified-memory boxes, MARLIN MXFP4 ballooning past the memory cap) and a request to publish the prebuilt CUTLASS image.

Read the full diagnosis on the blog →

Issue open vllm-project/vllm #43969

gpt-oss-120b MXFP4 on GB10: MARLIN allocates past gpu-memory-utilization; CUTLASS source build resolves it (diagnostic comment)

Added three diagnostic data points to the existing unified-memory ARM OOM report: the MARLIN MXFP4 MoE path reaching ~118 GB despite a 0.7 gpu-memory-utilization cap, the FlashInfer MXFP4 kernel rejecting SM121A outright, and a source-built CUTLASS + FlashInfer image (arch 12.1a) that respects the cap at ~61 GB and serves at 59.5 tok/s. Narrows the bug to memory accounting in the MARLIN fallback rather than weights or hardware.

Read the full diagnosis on the blog →

PR open vite-pwa/docs #192

docs(deployment): fix grammar in AWS Amplify WIP placeholder

Re-submission of vite-plugin-pwa #930 after maintainer @userquin closed the original (wrong-repo: docs moved to this dedicated repository). The same 'Will coming soon' typo found in deployment/aws.md — 'will' takes a bare infinitive, not a present participle. 1-line fix to 'Coming soon'. Two-step engagement pattern: (1) initial PR in wrong repo, (2) maintainer redirects, (3) re-submit cleanly + offer follow-up cleanup of orphan placeholder files in the legacy repo.

Issue open OpenHands/OpenHands #14287

[Bug]: RecallAction with EventSource.USER after a user message produces two consecutive USER turns, breaks Mistral / SGLang strict-alternation backends

AgentController._handle_message_action dispatches a RecallAction with EventSource.USER after every user message. The recall lands as a synthetic USER turn, so the LLM request carries two consecutive user messages and any Mistral-template-strict backend rejects it with HTTP 400 on prompt 1 of every session. Three suggested fixes plus the controller-patch workaround that operators run today.

Read the full diagnosis on the blog →

Issue open mistralai/mistral-vibe #667

[Bug]: Context-window overflow during write_file silently drops file content, agent reports 'committed' on a corrupted file

Under context pressure the model emits a partial-rewrite that drops chunks of the original file, but the tool-call success shape is unchanged so the write commits silently. Suggested handling: pre-commit size-delta check, an apply_patch primitive bounded to the section being edited, and a context-overflow flag that an MCP can refuse on.

Read the full diagnosis on the blog →

Issue open vllm-project/vllm #37431

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1

Cross-pollination comment from adjacent inference engines on the same DGX Spark / GB10 hardware: SGLang nightly with Mistral Small 4 NVFP4 + vLLM for Voxtral. Pointed maintainers to sibling sgl-project/sglang#21085 and the two open SGLang PRs attempting fixes (#21558, #24269). Reinforced @EmilHaase's finding that native causal-conv1d works on SM121 but Triton-generated equivalents fail — the bug is in the Triton-codegen-for-SM121 path specifically, not the substrate. Offered to test the CUDA_LAUNCH_BLOCKING=1 + --enforce-eager workaround on this Spark and cross-check whether vLLM's Triton SSM ops fail the same way SGLang's flashinfer-GDN path does. Asked for a minimal Mamba-2 repro that avoids the 120B Nemotron-3 weights.

Issue closed huggingface/huggingface_hub #4223

`hf download` exits 0 while leaving `.incomplete` blobs after xet 416 CAS errors

`hf download` exits with status 0 while leaving `.incomplete` blobs in the cache when xet 416 CAS errors occur on flaky upstream connectivity. Distinct from #3960 (process killed): in our case the exit code lies, only filesystem state is the ground truth. Repro covers DGX Spark (GB10 Blackwell SM12.1, ARM v9.2-A) on three large models (Voxtral-4B-TTS-2603, VibeVoice-Large, Qwen3.6-35B-A3B-PrismaQuant). Three independently-effective workarounds documented (HF_HUB_DISABLE_XET=1, HF_HUB_DOWNLOAD_TIMEOUT=300, filesystem .incomplete check), plus the 347-LOC hf-pull wrapper combining all three with exponential-backoff retries. Resolved and closed 2026-06-11: maintainer @Wauplin reproduced the exit-0-on-truncation deterministically and confirmed the exit-code-vs-filesystem-state mismatch was the real bug. Root cause in hf_xet 1.4.2: a spurious within-file HTTP 416 was treated as end-of-file, and the partial byte count was returned as success without comparing it to the expected size. Fixed in hf_xet >= 1.5.1 via xet-core #716 (bounds the reconstruction range to the known file size) and #735 (adds a downloaded-vs-expected size check that turns a short read into a hard error instead of a silent exit 0); huggingface_hub #4306 further reworks the cache write path so partial files can never be committed in place.

Read the full diagnosis on the blog →

PR closed gpustack/gpustack #5274

fix: thread-safe tqdm progress hijack + cleanup hook on close (#3456)

Issue #3456 (open since 2025-11-23, no maintainer reply) reports that GPUStack's HuggingFace model-download progress shows wrong percentages until container restart. Reading model_file_manager.py revealed a cluster of seven open issues (#3547, #3833, #3853, #3882, #5253, #2734) all touching the same monkey-patched tqdm logic. Three compounding bugs identified: thread-unsafe _handle_tqdm_init while HfDownloader runs thread_map(max_workers=8), no tqdm.close hook so _file_line_mapping accumulates stale entries (which is exactly what container-restart resets), and a separate two-tqdm-instances-per-file pattern visible in the reporter's logs. PR scope: bugs A and C only (lock the init mutations + add a close hook). Iterated once after gemini-code-assist flagged that _assign_file_basename was still outside the lock; moved it inside in the second commit. Two-tqdm-per-file behavior left for a follow-up PR with a local reproducer (HF chunked-download interrupt mid-transfer). Engagement model: identified the cluster first to give the maintainer a one-PR-fixes-seven-issues hook rather than a single bug report.

PR closed vite-pwa/vite-plugin-pwa #930

docs(astro): fix grammar in Astro WIP placeholder (closed: wrong repo)

Closed by maintainer @userquin: the framework docs pages in this repository are orphan placeholders, the canonical docs moved to vite-pwa/docs. PR re-submitted there as #192. While engaging on the close, I asked the maintainer whether the orphan files (astro.md + iles.md, both <12 lines) should be removed entirely from this repo to prevent search-engine landings on stale content. Engagement model worked even on a closed PR: same-day maintainer response, helpful redirect, opening for follow-up cleanup PR.

PR closed getAlby/bitcoin-connect #384

fix: fire webln:enabled event on successful connect (closes #82)

Bitcoin Connect listens to the webln:enabled event in boot.ts to detect external WebLN providers, but never dispatches it itself when one of its own connectors finishes connecting. PR dispatched the event after a successful connect in store.ts and tightened the listener guard in boot.ts. Closed by maintainer rolznz on 2026-05-06 with a design clarification: BC deliberately does not own window.webln, consumers manage the provider locally via onConnected, so firing the global event would mislead spec-following callers that read window.webln directly. Useful outcome anyway — the closure surfaced the documented ownership model and led to companion docs PR #385.

Read the full diagnosis on the blog →

Issue closed openclaw/openclaw #77336

[Feature Request]: Built-in handling of strict role alternation for Mistral / SGLang backends

OpenClaw's request shape produces consecutive user messages in normal multi-turn flows, which Mistral's strict-alternation chat template rejects with HTTP 400. Filed as a feature request with two suggested in-repo approaches (provider capability flag, or pre-send hook on the Mistral plugin). Closed not-planned on 2026-06-01: migrating my own stack to Qwen removed the strict-alternation constraint, so I am no longer blocked and won't be pursuing the PR. The side-car proxy remains a documented workaround for anyone still serving Mistral via SGLang.

Read the full diagnosis on the blog →

Why this page exists

Every fix that ships in production here started as a bug somewhere upstream. Sending the fix back is the honest move and produces a more durable record than a private patch. This page is also the consolidated public-facing index of contributions, easier than scrolling GitHub profile activity.

If a contribution is open and you have an interest in it landing, react on the GitHub issue or PR directly. The blog posts give the engineering context, the upstream tracker is where the merge decision happens.

Open issues (local, not yet upstream)

Issues found while running this stack that have not yet been filed upstream or that are local-design problems with no obvious upstream target. Listed here so the gap is visible, not hidden. The same Engineering Honesty Rule 6 commitment as the manifesto: publish corrections, not just the original claim.