Submitted the production Qwen3.6-35B AutoRound int4-mixed + DFlash k=3 config as a community recipe: 69.2 tok/s single-stream decode (prefill-separated, cross-checked against vLLM's TPOT metric), 18/18 on a deterministic coding gate, vision tower available behind an extra_flags toggle. Merged same day. Anyone with a DGX Spark can now run the exact daily-driver config behind this blog with one sparkrun command.
Read the full diagnosis on the blog →
Issue #215 (filed 2024-05-06) flagged that the README's documented pattern for assigning the provider to window.webln in onConnected leaves a stale provider on the global after disconnect, the saved reference outlives the wallet connection. Library can't own window.webln (would conflict with consumer expectations + other extensions), so the right fix is documentation. PR extends the WebLN global object example to pair onConnected with onDisconnected and delete the global. Merged 2026-05-07.
Read the full diagnosis on the blog →
Listed the Sovereign AI MCP under Knowledge & Memory in the canonical awesome-mcp-servers index. Part of a four-registry distribution push (Smithery, Glama Connector, Glama Server, awesome-mcp) executed in a single afternoon. Merged within hours of submission.
Read the full diagnosis on the blog →
Production field report to the quant authors: their int4-mixed build of Qwen3.6-35B replaced a 4.75-bit compressed-tensors build as this stack's resident model. +12.7% decode under the same ruler, identical 18/18 on the deterministic coding gate, and of the three quants compared it was the only one that kept a working vision tower. Most quant feedback is benchmarks; this is a deployment decision with the receipts.
Read the full diagnosis on the blog →
Shared the deterministic-gate measurement with the maintainers: Serena rescued a weak model on the dangerous ambiguous-rename task (0/3 to 1/3 correct, collateral damage down from 8 files to 1.7) while a strong model never needed it and paid +15-158% input tokens in tool-schema overhead. Net: a semantic guardrail for weak models, a tax on strong ones, and a concrete optimization target (schema size) for the project.
Read the full diagnosis on the blog →
Took the 65-75% token-reduction claim of a ~200k-install skill to five models with deterministic gates. What reproduces: ~-31% output tokens on chat answers. What does not: the headline claim, on any model; the instruction rides as ~1k input tokens per request, so total cost never dropped, and one model's outputs got 18% longer. Suggested an honest re-scoping of the claim rather than its removal.
Read the full diagnosis on the blog →
Docs suggestion from a measurement trap caught in production: DFlash's speculative-decoding acceptance rate needs a real request window to warm up, so a cold TPOT reading right after launch undersells steady-state by ~35% (43 vs 69 tok/s on the same unchanged setup). Exactly the kind of number that fuels wrong 'X is slow' reports. Suggested one README line: benchmark after warmup.
Read the full diagnosis on the blog →
Independent reproduction of the verified gpt-oss-120b single-Spark leaderboard entry: 59.5 tok/s measured against their published 58.82, with a GPU-clock-lock check confirming the number is a memory-bandwidth ceiling rather than a tuning artifact. Plus two documented traps in the nightly-image path (44-minute compile hang that freezes unified-memory boxes, MARLIN MXFP4 ballooning past the memory cap) and a request to publish the prebuilt CUTLASS image.
Read the full diagnosis on the blog →
Added three diagnostic data points to the existing unified-memory ARM OOM report: the MARLIN MXFP4 MoE path reaching ~118 GB despite a 0.7 gpu-memory-utilization cap, the FlashInfer MXFP4 kernel rejecting SM121A outright, and a source-built CUTLASS + FlashInfer image (arch 12.1a) that respects the cap at ~61 GB and serves at 59.5 tok/s. Narrows the bug to memory accounting in the MARLIN fallback rather than weights or hardware.
Read the full diagnosis on the blog →
Re-submission of vite-plugin-pwa #930 after maintainer @userquin closed the original (wrong-repo: docs moved to this dedicated repository). The same 'Will coming soon' typo found in deployment/aws.md — 'will' takes a bare infinitive, not a present participle. 1-line fix to 'Coming soon'. Two-step engagement pattern: (1) initial PR in wrong repo, (2) maintainer redirects, (3) re-submit cleanly + offer follow-up cleanup of orphan placeholder files in the legacy repo.
AgentController._handle_message_action dispatches a RecallAction with EventSource.USER after every user message. The recall lands as a synthetic USER turn, so the LLM request carries two consecutive user messages and any Mistral-template-strict backend rejects it with HTTP 400 on prompt 1 of every session. Three suggested fixes plus the controller-patch workaround that operators run today.
Read the full diagnosis on the blog →
Under context pressure the model emits a partial-rewrite that drops chunks of the original file, but the tool-call success shape is unchanged so the write commits silently. Suggested handling: pre-commit size-delta check, an apply_patch primitive bounded to the section being edited, and a context-overflow flag that an MCP can refuse on.
Read the full diagnosis on the blog →
OpenAIAdapter._parse_message rejected streaming chunks 2+ because delta.role is null per OpenAI spec but LLMMessage.role is a required enum without default. TUI failed on the first user prompt of every session. Three-line defensive default, plus a suggestion for a cleaner upstream form (Optional role with default).
Read the full diagnosis on the blog →
Reproduction with raw SGLang chunk dump, two falsified hypotheses (a 2.9.3 streaming-parser regression and a Pydantic 2.13 enum-strictness change) before the actual root cause was traced to _parse_message.
Read the full diagnosis on the blog →
Cross-pollination comment from adjacent inference engines on the same DGX Spark / GB10 hardware: SGLang nightly with Mistral Small 4 NVFP4 + vLLM for Voxtral. Pointed maintainers to sibling sgl-project/sglang#21085 and the two open SGLang PRs attempting fixes (#21558, #24269). Reinforced @EmilHaase's finding that native causal-conv1d works on SM121 but Triton-generated equivalents fail — the bug is in the Triton-codegen-for-SM121 path specifically, not the substrate. Offered to test the CUDA_LAUNCH_BLOCKING=1 + --enforce-eager workaround on this Spark and cross-check whether vLLM's Triton SSM ops fail the same way SGLang's flashinfer-GDN path does. Asked for a minimal Mamba-2 repro that avoids the 120B Nemotron-3 weights.
`hf download` exits with status 0 while leaving `.incomplete` blobs in the cache when xet 416 CAS errors occur on flaky upstream connectivity. Distinct from #3960 (process killed): in our case the exit code lies, only filesystem state is the ground truth. Repro covers DGX Spark (GB10 Blackwell SM12.1, ARM v9.2-A) on three large models (Voxtral-4B-TTS-2603, VibeVoice-Large, Qwen3.6-35B-A3B-PrismaQuant). Three independently-effective workarounds documented (HF_HUB_DISABLE_XET=1, HF_HUB_DOWNLOAD_TIMEOUT=300, filesystem .incomplete check), plus the 347-LOC hf-pull wrapper combining all three with exponential-backoff retries. Resolved and closed 2026-06-11: maintainer @Wauplin reproduced the exit-0-on-truncation deterministically and confirmed the exit-code-vs-filesystem-state mismatch was the real bug. Root cause in hf_xet 1.4.2: a spurious within-file HTTP 416 was treated as end-of-file, and the partial byte count was returned as success without comparing it to the expected size. Fixed in hf_xet >= 1.5.1 via xet-core #716 (bounds the reconstruction range to the known file size) and #735 (adds a downloaded-vs-expected size check that turns a short read into a hard error instead of a silent exit 0); huggingface_hub #4306 further reworks the cache write path so partial files can never be committed in place.
Read the full diagnosis on the blog →
Issue #3456 (open since 2025-11-23, no maintainer reply) reports that GPUStack's HuggingFace model-download progress shows wrong percentages until container restart. Reading model_file_manager.py revealed a cluster of seven open issues (#3547, #3833, #3853, #3882, #5253, #2734) all touching the same monkey-patched tqdm logic. Three compounding bugs identified: thread-unsafe _handle_tqdm_init while HfDownloader runs thread_map(max_workers=8), no tqdm.close hook so _file_line_mapping accumulates stale entries (which is exactly what container-restart resets), and a separate two-tqdm-instances-per-file pattern visible in the reporter's logs. PR scope: bugs A and C only (lock the init mutations + add a close hook). Iterated once after gemini-code-assist flagged that _assign_file_basename was still outside the lock; moved it inside in the second commit. Two-tqdm-per-file behavior left for a follow-up PR with a local reproducer (HF chunked-download interrupt mid-transfer). Engagement model: identified the cluster first to give the maintainer a one-PR-fixes-seven-issues hook rather than a single bug report.
Closed by maintainer @userquin: the framework docs pages in this repository are orphan placeholders, the canonical docs moved to vite-pwa/docs. PR re-submitted there as #192. While engaging on the close, I asked the maintainer whether the orphan files (astro.md + iles.md, both <12 lines) should be removed entirely from this repo to prevent search-engine landings on stale content. Engagement model worked even on a closed PR: same-day maintainer response, helpful redirect, opening for follow-up cleanup PR.
Bitcoin Connect listens to the webln:enabled event in boot.ts to detect external WebLN providers, but never dispatches it itself when one of its own connectors finishes connecting. PR dispatched the event after a successful connect in store.ts and tightened the listener guard in boot.ts. Closed by maintainer rolznz on 2026-05-06 with a design clarification: BC deliberately does not own window.webln, consumers manage the provider locally via onConnected, so firing the global event would mislead spec-following callers that read window.webln directly. Useful outcome anyway — the closure surfaced the documented ownership model and led to companion docs PR #385.
Read the full diagnosis on the blog →
OpenClaw's request shape produces consecutive user messages in normal multi-turn flows, which Mistral's strict-alternation chat template rejects with HTTP 400. Filed as a feature request with two suggested in-repo approaches (provider capability flag, or pre-send hook on the Mistral plugin). Closed not-planned on 2026-06-01: migrating my own stack to Qwen removed the strict-alternation constraint, so I am no longer blocked and won't be pursuing the PR. The side-car proxy remains a documented workaround for anyone still serving Mistral via SGLang.
Read the full diagnosis on the blog →
No contributions match the current filter.
Why this page exists
Every fix that ships in production here started as a bug somewhere upstream. Sending the fix back is the honest move and produces a more durable record than a private patch. This page is also the consolidated public-facing index of contributions, easier than scrolling GitHub profile activity.
If a contribution is open and you have an interest in it landing, react on the GitHub issue or PR directly. The blog posts give the engineering context, the upstream tracker is where the merge decision happens.