Why hf download Lies to You at 22 GB on DGX Spark

May 13, 2026 9 min read

I tried to pull rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm onto the DGX Spark overnight, 22 GB across six safetensor shards. The orchestrator marked it OK in 20s and moved on to the next model. Disk: 1.8 GB present, five .incomplete files behind it. The CLI lied about its own success.

This article is the postmortem on the three distinct failure modes that bit me on the same overnight run, and the hf-pull wrapper that makes the next run actually robust. The original setup article I wrote two weeks ago mentioned the first fix (HF_HUB_DISABLE_XET=1) but not the other two, because I had not hit them yet at the scale that exposes them.

Why this matters in context. The Qwen3.6 weights are the LLM half of the model-stack migration described in Spark Arena Rank 4 Made Me Add Qwen3.6 to My DGX Spark. The TTS-candidate weights (VibeVoice / Higgs Audio v2 / IndexTTS-2) are the spike pool described in Voxtral Capped at 3/10: Picking the Next Open TTS. Both strategy articles now point at hf-pull as the standard model-download path going forward.

Quick Take

Failure 1: Xet protocol defaults to IPv6, unreachable on DGX Spark. Fix: HF_HUB_DISABLE_XET=1. Documented in the SGLang setup article.

Failure 2: Default httpx read-timeout (around 15 sec idle) is too short for 3.5 GB shards through HF CDN edge slowdowns. Fix: HF_HUB_DOWNLOAD_TIMEOUT=300. Not documented anywhere I had found before.

Failure 3: hf download returns exit zero even when five of six shards are left as .incomplete. Fix: validate by filesystem, not by exit code. The CLI exit code is advisory at best.

Wrapper at /data/scripts/ops/hf-pull combines all three fixes plus exponential backoff, lock cleanup, progress lines that work in log files, and matrix-butler notification. Source in cipherfox/sovereign-ops.

What broke, in order

The first orchestrator run ended at 00:44 last night with this summary:

✅ z-lab/Qwen3.6-35B-A3B-DFlash               OK in  20s
✅ microsoft/VibeVoice-Realtime-0.5B          OK in  20s
✅ vibevoice/VibeVoice-1.5B                   OK in 170s
✅ bosonai/higgs-audio-v2-generation-3B-base  OK in  20s
❌ rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit FAILED
❌ aoi-ot/VibeVoice-Large                     FAILED
❌ IndexTeam/IndexTTS-2                       FAILED

Four of seven looked fine. Three failed. The disk told a different story:

$ du -sh /ai/models/models--*/
905M   models--z-lab--Qwen3.6-35B-A3B-DFlash         (claimed OK — really done)
1.9G   models--microsoft--VibeVoice-Realtime-0.5B    (claimed OK — really done)
5.1G   models--vibevoice--VibeVoice-1.5B             (claimed OK — really done)
160M   models--bosonai--higgs-audio-v2-...           (claimed OK — actually JSON only)
1.9G   models--rdtand--Qwen3.6-35B-A3B-PrismaQuant   (claimed FAILED — really 8% done)
9.2G   models--aoi-ot--VibeVoice-Large               (claimed FAILED — really 65% done)
0      models--IndexTeam--IndexTTS-2                 (claimed FAILED — really 0%)

The Higgs Audio entry is the smoking gun. hf download reported OK in 20s and the orchestrator believed it. The actual on-disk size was 160 MB out of an expected 21.5 GB. The CLI had downloaded only the JSON manifests and the model.safetensors.index.json, then returned exit zero.

Two of the three “FAILED” cases were actually partial successes. VibeVoice-Large at 9.2 GB out of 14 GB is 65% done. The exit-code signal is unreliable in both directions.

Failure 1: Xet protocol over IPv6

When huggingface_hub decides a file should come through Xet content-addressed storage (their newer CDN protocol), it tries to reach cas-server.xethub.hf.co. On DGX Spark with the default network stack, IPv6 to that hostname does not work, and the fallback path inside xet-core 1.4.2 hits an IPv6-flavored URL anyway. The error you see is:

Fatal Error: "cas::get_reconstruction" api call failed (request id ...):
HTTP status client error (416 Range Not Satisfiable)

The 416 is a red herring. It looks like a partial-content negotiation problem. It is actually an IPv6 path that returned an empty response, which the Rust client then misinterpreted as a 416. The same problem on x86 with dual-stack networking does not occur, which is why the upstream maintainers have not prioritized a fix.

The two-line fix from my SGLang setup article:

export HF_HUB_DISABLE_XET=1
wget -4 <url>      # for any direct wget calls

HF_HUB_DISABLE_XET=1 forces huggingface_hub to fall back to the standard HTTP LFS protocol. Downloads go through huggingface.co/<repo>/resolve/<ref>/<file> instead of through xet CAS reconstruction. Plain HTTP LFS works fine on DGX Spark.

There is one additional gotcha I had not seen in the original setup. The xet directory under HF_HOME gets created with root:root ownership if a sudo-launched process touches it first. The xet-runtime then cannot write logs there, fails with Permission denied (os error 13), and the failure cascades into the same 416 pattern even when HF_HUB_DISABLE_XET=1 is set, because parts of the xet-init still run regardless. Fix:

pkexec chown -R cipherfox:cipherfox /ai/models/xet

After both fixes, this failure mode is fully resolved.

Failure 2: httpx read-timeout vs HF CDN edge inconsistency

The DGX Spark line at this address is 319 Mbps down, 13 ms ping. A 22 GB Qwen3.6 model should download in roughly 10 minutes at line speed. The actual experience was several hours of The read operation timed out errors:

Error while downloading from
  https://huggingface.co/rdtand/...-PrismaQuant-.../model-00003-of-00006.safetensors:
  The read operation timed out
Trying to resume download...
Error while downloading from
  https://huggingface.co/rdtand/...-PrismaQuant-.../model-00001-of-00006.safetensors:
  The read operation timed out
Trying to resume download...

This is not a network problem on your side. It is HF CDN edge nodes serving 3.5 GB shards with intermittent throughput drops. One second the chunk arrives at 40 MB/s. The next second the same TCP connection delivers nothing for 16 seconds. The default httpx read-timeout (around 15 seconds for idle data) triggers, the connection is reset, huggingface_hub retries from the partial download, and the same shard hits the same edge node a few minutes later.

The fix is to give the timeout enough headroom that a 30-second CDN hiccup does not kill the connection:

export HF_HUB_DOWNLOAD_TIMEOUT=300

Five minutes per idle period is generous. It does not slow successful downloads at all. It only changes the give-up threshold. Without this, big-shard models on the DGX Spark are basically a coinflip per attempt. With this, the same 22 GB download completed in under 15 minutes on the second attempt.

Failure 3: hf download exits zero with .incomplete files behind

This is the worst of the three because it silently corrupts your orchestration logic. After enough retries, huggingface_hub.snapshot_download returns successfully even when individual shard downloads have given up. The snapshot directory ends up with symlinks for the small files (manifests, tokenizer, config) and either missing symlinks or symlinks pointing to .incomplete blobs for the large shards.

The hf CLI propagates exit zero. Any orchestrator that trusts the exit code marks the model as DONE and never tries again.

You can reproduce this in 30 seconds:

# Run hf download, kill it mid-stream, wait, run again. Exit code is 0.
$ hf download rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm
^C
$ hf download rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm
$ echo $?
0
$ find /ai/models/models--rdtand--Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm/blobs/ -name "*.incomplete" | wc -l
5

Exit zero. Five incomplete files on disk. The orchestrator believes the model is done.

The fix is to never trust the exit code. Authoritative completeness is filesystem state:

def is_complete(cache_dir: Path) -> bool:
    blobs = cache_dir / "blobs"
    if not blobs.is_dir():
        return False
    incomplete = list(blobs.glob("*.incomplete"))
    return len(incomplete) == 0

Cross-check against the manifest if you want a second safety net: download size should be within one percent of sum(siblings.size) from HfApi().model_info(repo_id, files_metadata=True). The wrapper does this on validate-only runs.

The hf-pull wrapper

The three fixes live together at /data/scripts/ops/hf-pull in the cipherfox/sovereign-ops repo. The shape:

hf-pull <repo-id> [<repo-id> ...]
hf-pull --models-from list.txt
hf-pull --validate-only <repo-id>
hf-pull --max-attempts 25 --timeout 600 <repo-id>
hf-pull --notify <repo-id>

Key design choices:

Subprocess-based not library-based. I wrap hf download as a subprocess instead of calling snapshot_download directly. If hf internals raise an uncaught exception, the parent script catches the non-zero exit and retries cleanly. Library-level integration means a single bad import or a KeyError deep in hub code crashes the orchestrator.

Exponential backoff with cap. 10, 30, 60, 120, 240, 300, 300, 300 seconds. Most transient errors clear within 60 seconds. The cap at 300 seconds prevents pathological waits during HF maintenance windows.

Lock cleanup between attempts. huggingface_hub writes .lock files into HF_HOME/.locks/<repo>/. If a previous attempt died ungracefully, the lock files block the next attempt until they age out. The wrapper deletes them before each retry.

Progress lines that work in log files. Standard tqdm progress bars are noise in nohup logs. The wrapper emits one line every 30 seconds with the format:

  [<repo>] 47.3%  10.41 GB / 22.02 GB  18.4 MB/s

Computed from disk-byte-count vs manifest-size, so it works whether the underlying hf download is showing its own progress bars or not.

Filesystem-level validation. After every attempt, the wrapper runs the is_complete check above. Success only when zero .incomplete files remain. The hf-CLI exit code is logged as advisory but never decides.

Idempotent. Resume-friendly. No state file required. If you kill it mid-run and start it again, it picks up exactly where it left off, because the cache directory is the state.

What this does not solve

This wrapper does not improve the actual download speed. HF CDN edge inconsistency is upstream. The wrapper just makes the local code resilient to it.

It does not detect content corruption. If HF serves a partial response that fakes the right size, the resulting file is corrupt and the wrapper will not catch it. The right hash check is huggingface_hub.utils.validate_hash_from_pointer but that adds significant disk-IO. Acceptable trade for now.

It does not work around HF rate-limits. Unauthenticated requests are limited; for repeated large pulls, set HF_TOKEN to a free HF account token to lift the throttle.

It does not download models that require accepting a license click-through on the HF website. Those need an HF_TOKEN tied to an account that has accepted the terms.

Updated reading-path

The original SGLang setup article mentioned the Xet fix. After May 13, 2026, the recommended pattern for any DGX Spark model pull is:

hf-pull <repo-id>

instead of the bare hf download from the original article. The wrapper takes care of HF_HUB_DISABLE_XET=1, HF_HUB_DOWNLOAD_TIMEOUT=300, retries, and validation. For one-off pulls of small files where you accept the risk, the bare command is still fine.

Source: cipherfox/sovereign-ops/ops/hf-pull on Gitea. Mirrored to GitHub on the next deploy.

What I Actually Use Now

hf-pull <repo> for every model pull on the DGX Spark, no exceptions

HF_HOME=/ai/models shared cache so the wrapper and direct hf download see the same state

--validate-only after every overnight run to confirm filesystem completeness before launching the actual workload

matrix-butler notifications via --notify for unattended runs, so I know which models actually finished before I start the inference container