Three Quants of One 35B Qwen on a DGX Spark. The Fastest Build Was the Only One That Could Still See.
I run one model as my daily coding driver on a DGX Spark: a Qwen3.6-35B. But “a Qwen3.6-35B” is not one file. It ships in several quantizations, each a different way of squeezing 35 billion parameters down to something that serves fast on a 128GB box. The weights are nominally the same network. The quant decides how much of it survives the squeeze, and where.
So I did the boring, necessary thing. I took three quants of the exact same model, served each one through its production launcher on the same box, and measured them on the three things I actually care about: how fast it decodes, whether it gets coding tasks right, and whether it can read an image. One ruler per axis. Every failed measurement thrown out rather than published.
The winner was clean, and the interesting part was not the speed. It was that the smallest, fastest build was the only one of the three that could still see.
What a quant actually is, and what it costs
A model is a giant pile of numbers. Trained at full precision those numbers are 16 bits each, and a 35B model at 16 bits does not leave much room on a shared-memory box for anything else. Quantization rewrites those numbers at lower precision: 8 bits, 4 bits, sometimes less. Smaller weights mean less memory and, because every token has to stream those weights through the chip, faster decode. The cost is accuracy. Round too hard and the model gets measurably dumber.
The three contenders take three different routes:
- AutoRound int4, Intel’s quantization method. It tunes the rounding itself with a short optimization pass instead of rounding blindly, which is why it tends to stay close to the full-precision model at four bits where naive methods lose ground.
- PrismaQuant 4.75-bit, a slightly higher-precision community build in the compressed-tensors format, a safetensors extension that can store different precisions for different layers in one file. That mixing is exactly how a build lands at an odd average like 4.75 bits instead of a flat 4 or 8: the layers most sensitive to rounding get more bits, the rest get fewer. On paper that extra three-quarters of a bit should buy a little more fidelity than a flat int4.
- FP8, eight-bit floating point. Twice the bits of int4, the most conservative squeeze of the three, and the one you would expect to be safest.
That is the hypothesis worth testing: more bits should mean more faithful. Hold it loosely.
The measurement, and why the ruler matters
This is the part most quant comparisons get wrong, and it is the part that can quietly poison a whole blog. There are several tools that will hand you a decode-rate number, and they do not agree with each other. A throughput benchmark I tried gave wildly different figures for the same unchanged model on repeat runs, one of three measurement traps I had to catch this same day. So the rule here is simple and absolute: pick one ruler per axis, measure every contender with that same ruler, and only ever compare same-ruler numbers. A speed from tool A and a speed from tool B are not a comparison, they are a trap.
For speed I used a prefill-separated decode measurement, the same method that decided which quant goes to production in the first place, and cross-checked the winner against vLLM’s own internal token-timing metric. The two independent rulers agreed on the winner to within a fraction of a token per second, which is how you know the number is real and not an artifact.
For accuracy I used a small deterministic-gate benchmark: the model gets a coding task, a script decides pass or fail, and no model ever grades another model. The tasks are real refactors and structured edits from my own work. The harness is the agent-bench project.
For vision I did the only test that counts: I handed each build a screenshot and asked it to read it.
Speed: the lean build wins, and it is not close
Measured with the same prefill-separated ruler, on the same box, through each model’s production launcher:
| quant | decode | versus winner |
|---|---|---|
| AutoRound int4 | 69.2 tok/s | baseline |
| PrismaQuant 4.75-bit | 61.4 tok/s | 13% slower |
| FP8 | (see below) | not viable |
AutoRound decodes about 13% faster than PrismaQuant. Part of that is the smaller weights, fewer bits to stream per token. But part of it is a real production difference: the AutoRound build runs cleanly with speculative decoding, a trick where a small draft model proposes several tokens at once and the big model checks them in a batch. On the PrismaQuant build that same speedup path regressed, so in practice it runs without it. I am not hiding that behind the headline number, I am telling you it is part of the headline number. The production launcher is the honest unit of comparison, because the production launcher is what actually serves you tokens.
One trap I walked into and want to flag, because it is exactly the kind of thing that produces a wrong number: a cold, tiny measurement of the AutoRound build right after launch read 43 tok/s, far below the real rate. That was not the model, it was the speculative-decoding path not yet warmed up over enough requests. Warm it properly, measure over a real window, and it settles at 69. A benchmark that hands you a number is not the same as a trustworthy number. The cold 43 went in the bin.
Accuracy: a tie at the top, and one collapse
On the coding gate, the two int4-class builds are indistinguishable:
| quant | agent-bench pass rate |
|---|---|
| AutoRound int4 | 100% (9/9) |
| PrismaQuant 4.75-bit | 100% (9/9) |
| FP8 | 0% (0/7) |
AutoRound and PrismaQuant both pass everything. The extra three-quarters of a bit in PrismaQuant buys nothing measurable here, which is itself a useful result: at this size, on these tasks, Intel’s tuned int4 is already at the quality ceiling. You do not pay for the bigger build with accuracy, so the only thing the bigger build costs you is speed.
And FP8, the conservative eight-bit build everyone expects to be safest, scored zero. Not “slightly worse.” Zero. On this box, on the image I could run it on, it degenerated into reasoning loops that never produced a usable edit. The hypothesis that more bits means more faithful did not survive contact with the gate. The most likely explanation is an immature kernel path for FP8 on this two-month-old chip rather than the weights themselves, but the honest reporting is what the box did, and the box could not use it. FP8 is parked until that path matures.
The kicker: only one of them kept its eyes
Here is the result I did not expect and would not have found if I had only measured speed and accuracy.
A Qwen3.6 ships with a vision tower, a separate image encoder bolted onto the language model that turns pixels into tokens the model can read alongside text. It is its own block of weights, independent of the language layers, which is the catch: a quant pipeline aimed only at the language weights will either carry the vision tower along untouched or, if the author wants a smaller text-only artifact, drop it on the floor entirely. Nothing about the language quant tells you which choice was made. So I checked, the only way that counts, by handing each build a screenshot of the dashboard from this very stack and asking what it saw.
- AutoRound int4 read it correctly. Header text, status indicators, the lot, and it returned the answer as a proper structured response, not a guess. The lean four-bit build kept its eyes.
- PrismaQuant 4.75-bit is text-only. The vision tower was dropped in the build. It is blind, and no flag brings it back.
- FP8 carried the vision weights but, being unusable on the gate, the point is moot.
If you want to check a build before committing to a multi-gigabyte download, the file list usually tells you: a quant that kept its eyes ships a visibly separate block of vision weights in its tensor index, and a text-only build does not. The model card rarely says it outright. And once the thing is running, the check costs one request: send a screenshot, ask what it sees. I trust the second method more, because it measures the build you are actually serving, not the listing of the build someone says they uploaded.
This is the part that turns a 13% speed gap into a real decision. AutoRound is faster, ties on accuracy, and is the only one of the three that can look at a screenshot. PrismaQuant asks you to give up both speed and sight for three-quarters of a bit that bought no accuracy. There is a longer version of the vision story, including a wrong call I had to walk back about whether vision and tool-use can coexist, in the recipe teardown.
A note on the 119B in the room
For scale I also ran a much larger model, a 119B Mistral, through the same accuracy gate. It scored 88% and it keeps its vision, so it is a genuinely capable model. But it is a different network, not a quant of the same Qwen, and I could not get a clean single-stream decode number out of its server in this round, so I will not quote a speed I did not trust enough to publish. A bigger model with vision sounds like an upgrade until you notice the 35B already passes everything the gate throws at it, already sees, and decodes faster. Size was not the missing ingredient.
What I run, and why
AutoRound int4. Fastest of the three, tied at the top on accuracy, and the only one that can read an image. The decision was not a coin flip between close options, it was three independent measurements all pointing at the same build.
The broader lesson outlasts these specific files. The quant you pick is not a cosmetic choice between equivalent downloads, it is a capability choice: speed, accuracy, and whole modalities can live or die in the squeeze, and the only way to know which is to measure each axis with its own honest ruler and throw out the runs that lie. I deleted the PrismaQuant and FP8 weights after this, because once the numbers are captured and the winner is clear, keeping the losers around is just disk you are paying for. The measurement is the artifact worth keeping, not the also-rans.
This is the same box, and the same measure-it-yourself discipline, behind the gpt-oss-120b teardown and the Nemotron-120B run. Different models, same rule: the download counter is a measure of curiosity, not of what will actually serve you well on your own hardware.