Smaller, Faster, Still Smart? AutoRound int4 vs PrismaQuant for a Self-Hosted Coding Model
My coding agent runs on a self-hosted Qwen3.6-35B-A3B, quantized to 4.75 bit with a build called PrismaQuant, served by vLLM on a DGX Spark. It decodes at about 61 tokens per second, which is fine but never feels fast. So when a 4.0-bit quant of the same model showed up, an Intel AutoRound build, the obvious question was whether trading three quarters of a bit per weight buys real speed, and the obvious fear was whether it makes the model measurably dumber.
That fear is the right instinct and the wrong conclusion, at least here. I measured both halves, and the smaller quant won on speed without losing anything I could detect on quality. This is the duel, with every number reproducible.
Verdict at a glance
| Winner | AutoRound int4-mixed: +12.7% decode speed, zero measurable quality loss |
| Speed | 69.2 tok/s vs 61.4 tok/s for PrismaQuant (same DFlash k=3, same measurement) |
| Quality | 18/18 agent-bench tasks passed, identical to PrismaQuant, including the rename task that breaks weak quants |
| Catch | it is the -mixed variant (MoE gates kept at 16-bit), vision is dropped (irrelevant for a text coding role), N=3 is a signal not a proof |
| Do I switch? | Recommending yes for the coding role; rollback is a one-line edit because PrismaQuant stays on disk |
Update 2026-06-12: The switch happened, AutoRound is production now. Two facts above aged: the “vision is dropped” line in the catch row turned out to be wrong (the build carries a full vision tower, it was hidden by a stale
--language-model-onlylaunch flag, see the kicker), and PrismaQuant did not stay on disk forever, it was deleted after a third axis joined the comparison. The full three-way teardown including FP8 and the vision test is the quant comparison.
Why fewer bits should mean faster
Single-stream decode on this hardware is bandwidth-bound, not compute-bound. Every token generated requires reading the active weights out of memory, and the GB10’s unified LPDDR5x tops out around 273 GB/s. That ceiling, not the GPU’s math units, is what sets the token rate for a 35B mixture-of-experts model with roughly 3B active parameters.
So the size of the weights you move per token is the lever. PrismaQuant stores weights at 4.75 bits; AutoRound int4 stores them at 4.0. That is about 16% less data crossing the memory bus per token, which means the decode rate should rise by roughly the same fraction. The measured +12.7% lands right where the bandwidth math predicts, a little under the theoretical 16% because the KV-cache traffic and the fp16 gate layers do not shrink. This is also why the older “100+ tok/s” claims for this model never reproduced here: they would require moving less data than physics allows on a 273 GB/s bus.
Why fewer bits should mean dumber, and why it did not
Quantization is lossy. Rounding a weight from full precision down to 4 bits throws away information, and in principle that surfaces as worse reasoning, dropped instructions, or subtle wrong answers. Fewer bits, more rounding error, less smart. That is the textbook worry.
Two things blunt it. First, the method matters more than the bit count. AutoRound is a calibrated rounding scheme: it tunes each weight’s rounding direction against real activations rather than rounding naively, so a well-calibrated 4.0-bit model can sit closer to the original than a careless 4.75-bit one. The bit number is a ceiling on quality, not a measurement of it. Second, this is the -mixed build, which means the 243 most sensitive layers, all the mixture-of-experts gate projections, stay at 16 bit. The router that decides which experts fire keeps full precision; only the bulk weight matrices, which tolerate rounding well, drop to 4 bit. Precision is spent where it changes outcomes and saved where it does not.
That is the theory. The point of this blog is that theory is a hypothesis until it survives measurement.
The quality gate: my own benchmark, pointed at the quant
To test “is it still smart” I used agent-bench, the harness I built to measure whether agent tooling actually helps. Here it does something slightly different: it holds the agent and the tasks fixed and swaps the model underneath, so any change in the pass rate is the quant talking. Because both quants are served under the same model name on the same port, the harness cannot tell them apart, which is exactly what you want from a fair test.
The gate is deterministic, never an LLM grading another LLM. Each task either passes a typecheck, a real rename, or a frozen fact checklist, or it does not. I ran the baseline arm at N=3 on each task, on both the discriminating refactor and a spread of others:
| Task | What it checks | AutoRound int4-mixed |
|---|---|---|
| ts-ambiguous | rename UserRepository.save, leave Logger.save alone | 3/3 |
| ts-rename | rename a function across a small project | 3/3 |
| ts-callers | rename used across many files | 3/3 |
| chat-tcp | name the three TCP handshake packets | 3/3 |
| chat-chmod | explain chmod 750 per owner/group/other | 3/3 |
| chat-acid | name and explain the four ACID properties | 3/3 |
Eighteen out of eighteen. The same tasks on PrismaQuant pass 100%, so the two quants are indistinguishable on everything I tested. The task that matters most is ts-ambiguous, because that is where a weakened model fails in the dangerous way: it does a global text-replace, clobbers the unrelated Logger.save, and the broken result still compiles. AutoRound got it right every time, with the correct minimal four-file diff, the same as the full-precision-gate PrismaQuant. If the quant had eaten any reasoning, this is the task that would have shown it first. It did not.
The numbers, side by side
Decode throughput, measured with a fixed prompt at temperature 0, prefill time subtracted, median of three runs:
| Quant | bits | DFlash k | decode tok/s |
|---|---|---|---|
| PrismaQuant (prod at the time) | 4.75 | 3 | 61.4 |
| AutoRound int4-mixed | 4.0 | 3 | 69.2 |
| AutoRound int4-mixed | 4.0 | 8 | 67.6 |
Note the k=8 row. The spark-arena leaderboard reports 92.34 tok/s for a leaner int4 AutoRound with eight speculative tokens, so I tested k=8 too. It came in slower than k=3, which matches what I found for PrismaQuant in the same audit: three speculative tokens is the sweet spot on this hardware, and asking for more just wastes verification cycles. The leaderboard’s higher number comes from a different measurement style (short token-generation bursts rather than my prefill-separated 256-token decode) and the leaner non-mixed build. Different ruler, different number.
Why my 61 is not my own earlier 71
There is an honest wrinkle I have to flag, because I published a different number for this exact model before. A few weeks earlier I measured the same PrismaQuant build, same DFlash k=3, at about 71 tok/s, and here I am calling it 61.4. Same weights, same speculative setting, two numbers. Neither is a lie; they are two rulers.
The earlier 71 came from a throughput-style harness (the spark-arena recipe tooling, llama-bench lineage): it generates a short burst of tokens and reports the rate, with prefill and warmup folded into a generous steady-state. My 61.4 comes from measure.py, which is deliberately stricter: it sends one fixed prompt, throws away the prefill time entirely by subtracting a max-tokens-1 call, decodes a full 256 tokens so speculative acceptance averages out instead of riding a lucky opening burst, and takes the median of three runs at temperature 0. The stricter ruler reads lower because it refuses to count the parts that flatter the number: no prefill amortization, no short-burst peak, no warm-cache cherry pick.
Which is “right”? Both, for their question. If you want the marketing-friendly peak, the throughput harness is fair. If you want a conservative figure you can hold a regression test against, prefill-separated sustained decode is the honest one. What you must never do is compare a number from one ruler against a number from the other, which is exactly the trap the broken-ruler post is about. That is why every number in this article comes from the same measure.py for both quants: the +12.7% is a like-for-like delta, and it would hold at roughly the same percentage on the looser ruler too (71 becomes about 80). The absolute value moves with the method; the relative verdict does not.
Pros, cons, and the honest caveats
The case for AutoRound int4-mixed here: 12.7% more speed for free, no quality regression I could measure across coding and factual recall, and a clean rollback because the old quant never leaves the disk. On a bandwidth-bound box, a quant swap is one of the few changes that moves the token rate without touching the prompt or the model family.
The case against, stated plainly: N=3 per task is a strong signal, not a statistical proof, and my task set is TypeScript refactors plus three chat checklists, not a full reasoning suite. The -mixed build also drops vision, which is a non-issue for a text coding role served with --language-model-only but would matter if you needed image input. And “no measurable loss” is bounded by what I measured; a deeper reasoning benchmark could still find a gap I did not probe.
What this changes, and the learning
The actionable change is small: point qwen36-launch.sh at the AutoRound model and set the quantization flag to gptq, since the AutoRound build packs in a GPTQ-compatible format. The decode rate goes from 61 to 69 tok/s and the agent behaves the same.
The learning is bigger than this one swap. The bit count is the first thing you see and the wrong thing to fear. A 4.0-bit model built with a good calibration method and a mixed-precision scheme that protects the routing layers beat a 4.75-bit model on speed and tied it on every quality probe. If I had trusted the bit number, I would have left 13% of my throughput on the table to guard against a quality loss that the measurement says is not there. Measure the thing you are afraid of, not the proxy for it.
Reproduce it
The speed matrix and the quality gate are two scripts in the sovereignty-audit perf folder (phase3.py for throughput, the agent-bench broad run for quality). The harness is agent-bench; the models are Intel/Qwen3.6-35B-A3B-int4-mixed-AutoRound and the PrismaQuant build, both on vLLM. For the bandwidth reasoning behind all of it, see NVFP4 quantization explained and the spark-arena recipes benchmark; for the measurement-honesty lesson that started this whole audit, the broken-ruler story.
Part of the engineering log on running large self-hosted AI on a DGX Spark. Quality measured with agent-bench, the deterministic-gate harness from this same stack. Follow via RSS or Nostr.