When Smaller Stops Being Faster: a Quant Ladder on the MI300X
For three posts I ran coding models on a “cheap” Strix Halo miniPC, and one rule held the whole way through: smaller is faster. I put it plainly in the MTP post: since generation is bandwidth-bound, a smaller quant can mean a faster one, because a starved iGPU spends every token waiting on memory and a lighter model has fewer bytes to haul per token. Then I managed to get my hands on a very beefy modern server for a little while, with effectively infinite RAM and an MI300X to play with ;)
So what happens when you go the other direction, from the miniPC to the biggest GPU you can get your hands on? What happens when VRAM, bandwidth, and “what fits” all stop being the bottleneck? How much quality do you actually lose when you quantize a model, and is the speed worth it?
The quality answer is the dull one, so let’s get it out of the way: from BF16 down to Q4_K_M I couldn’t find a difference worth reporting (on either model, whether the benchmark was code, math, instruction-following, or knowledge).
The speed answer is why I bothered. Most of what we learned about quantizing on the miniPC was a bandwidth artifact: a set of lessons that were only the case because that little machine was starving for memory, and they cease to be true on a card with bandwidth to spare. The cleanest case: quantizing harder used to make a model generate faster, and here it does nothing at all, Q8 and Q4 decoding at the same speed. A few of the old rules go further than flat and reverse outright, and chasing down why is most of what follows.
There’s one practical takeaway worth pulling right to the front (which I..... have not done here, sorry, I’m a jerk). FP8 is the format you’d actually deploy on an MI300X, and served the naive/obvious way — --quantization fp8 and nothing else — it is slower than not quantizing at all, losing to plain BF16 at the low concurrency a single self-hosted endpoint actually runs at, and only inching ahead once you’re pushing 128 streams. One environment variable that, outside AMD’s and vLLM’s own posts and threads, I’ve barely seen mentioned anywhere, VLLM_ROCM_USE_AITER=1, swings it up to 46% the other way: a bigger lever than the entire gap between FP8 and full precision. If you serve models on AMD and that’s news, then you are welcome.
Contents
Why a “quant ladder” at all??
A quick refresher: A model’s weights are stored at some numeric precision. Train and release at BF16 (16 bits per weight), then quantize down — Q8 (~8 bits), Q6, Q5, Q4 — to make the file smaller and, on most hardware, faster. The catch is supposed to be quality: fewer bits per weight means a coarser approximation of what the model learned, and at some point it should start writing worse code.
The miniPC posts established one half of this: quality is hardware-agnostic. Run the same GGUF on a Strix Halo iGPU or an A30 and you get the same benchmark score within noise, because the hardware decides how fast, not how smart. This post is the other half: hold the hardware fixed, walk down the quant ladder, and watch what happens to both quality and speed. BF16 sits at the top as ground truth — the precision the model was actually released at — and each rung below trades bits for bytes.
The MI300X is the right place to run this cleanly, for two dumb reasons: 1) I never have to offload and 2) ... I mean someone gave me access to an MI300X! On the miniPC, “what fits” is the entire story. Here, a 35B or 80B model fits in a single GPU’s 192 GiB with room to spare at every quant including BF16, so nothing spills, nothing streams over PCIe, and the only variable that moves is the precision.
The models on the ladder
Two models, both of which have appeared in this series, both Mixture-of-Experts (total parameters large, active-per-token small):
| Model | Total / active | Role | Thinking? |
|---|---|---|---|
qwen3.6-35B-A3B |
35B / ~3B | the general reasoning model I landed on in this post | yes (on/off switch) |
qwen3-coder-next (80B-A3B) |
80B / ~3B | the bigger coder model from the same posts | no (instruct-only) |
Picking these two on purpose: one mid-size reasoning model with a thinking mode, one larger coder model without. If the quant curve looks the same on both, the finding generalizes past a single architecture.
How it was served
Two serving stacks, because no single one does everything I needed cleanly on ROCm:
- GGUF K-quants via llama.cpp built for ROCm/gfx942 — this is the ladder proper (Q8_0, Q6_K, Q5_K_M, Q4_K_M), and it’s the same quant format I ran on Strix Halo, so the numbers are directly comparable across the whole series. Plain
llama-quantize, no imatrix, to keep it honest and reproducible. - vLLM (AMD’s
rocm/vllm-devcontainer — bare-metal vLLM on ROCm 7.x is its own saga, container-only is the path) for two reference points the GGUF ladder can’t give me: the BF16 anchor and a native FP8 run, FP8 being the format you’d actually deploy in production.
The one wrinkle worth flagging up front: the BF16/FP8 reference points come from vLLM, while the Q8-to-Q4 ladder comes from llama.cpp. So a BF16-vs-Q8 comparison crosses a serving-stack boundary, and I won’t lean on it. The clean apples-to-apples quant curve is Q8 down to Q4, all llama.cpp, all the same harness — and that’s the one I draw conclusions from. BF16 and FP8 are there as sanity anchors, not as rungs.
A note on what I didn’t run: HumanEval+. I left it on the shelf on purpose, because I’ve watched my own HumanEval+ harness score the same model file anywhere from 37% to 92% just by rephrasing the prompt (nested code fences in the prompt template were enough to make a weaker quant emit a fragment instead of a function). A benchmark that swings 55 points on prompt formatting isn’t measuring the model, it’s measuring my harness, so I trust Polyglot and lm_eval here and treat HumanEval+ numbers, mine or anyone’s, with suspicion.
The big risk going in was narrower than any of this, and specific to this box. I’d been running qwen3-coder-next on llama.cpp for months on Strix Halo and the A30, so the model and its GGUFs were known-good; what was new here was the backend, llama.cpp built for ROCm/gfx942. qwen3-coder-next isn’t a plain transformer, it runs on the Qwen3-Next architecture with a hybrid/linear-attention design whose unusual kernels have needed their own fixes in llama.cpp before, and the Vulkan and CUDA paths I already trusted were no guarantee the HIP/ROCm path would compute those layers correctly. It’s the kind of model that can load up looking fine and still quietly emit garbage. So the de-risk gate was: build for gfx942, load, run one Polyglot exercise to prove the harness could still tell a right answer from a wrong one before committing to multi-hour runs. It cleared, which I don’t take for granted when an odd architecture meets a backend it hasn’t been road-tested on.
Quality: the ladder is flat
This is the half I expected to be dull, and it obliged. That quantizing down to Q4 barely touches quality is close to received wisdom by now, so I walked the ladder to confirm it on these two models, just in case of a surprise.
Here’s the coding result — Aider Polyglot, 225 multi-language exercises, pass@2 (the model gets a second shot after seeing test failures, which is how you’d actually use it):
| Quant | qwen3.6 thinking-on | qwen3.6 thinking-off | qwen3-coder-next |
|---|---|---|---|
| BF16 (vLLM ref) | 59.6 | 54.2 | – |
| Q8_0 | 65.3 | 56.0 | 57.8 |
| Q6_K | 61.3 | 57.3 | 58.2 |
| Q5_K_M | 65.3 | 52.4 | 58.7 |
| Q4_K_M | 61.8 | 50.2 | 56.9 |
| FP8 (vLLM ref) | 59.1 | – | – |
Take a long look at the numbers and let them sink in. The qwen3-coder-next ladder spans 1.8 points top to bottom — Q5 (the 5-bit quant!) scores higher than Q8. qwen3.6 thinking-on bounces between 61 and 65 with no downward trend; Q8 and Q5 tie at the top, Q6 dips below both. These aren’t degradation curves, they’re noise. On a 225-exercise benchmark, the run-to-run wobble is bigger than anything the quantization is doing. Down to Q4_K_M — a 4-bit approximation of the weights — there is no measurable coding-quality cost on either model.
And it’s not just coding. The whole point of running lm_eval on top was to check quality from angles Polyglot can’t see — math, instruction-following, broad knowledge:
| gsm8k | ifeval | mmlu_pro | |
|---|---|---|---|
| qwen3.6, Q8 to Q4 | 87.0 to 87.1% | 83.7 to 83.0% | 77.8 to 77.4% |
| coder-next, Q8 to Q4 | 89.6 to 87.8% | 79.9 to 81.1% | 76.2 to 76.0% |
Every delta from Q8 to Q4 is under one point. (One thing I watched for, since these are reasoning-capable models: if you grade a thinking model wrong, its chain-of-thought eats the answer field and gsm8k craters to single digits — a config bug that looks like a quality cliff. gsm8k landing at 87–90% is how I know the reasoning was handled correctly and these scores are real, not artifacts of a misgraded run.) Across four benchmarks and two models, going from 16 bits down to 4 costs you essentially nothing I can measure. That’s the result everyone hopes quantization gives and rarely gets to state this cleanly, so I’ll take the clean win before the speed section complicates things.
A notion that didn’t pan out: the thinking gap doesn’t widen
qwen3.6 has a thinking on/off switch, and I ran both modes partly to chase a notion: maybe chain-of-thought compensates for quantization damage, so the gap between thinking-on and thinking-off should widen as you quantize harder (thinking would have more to fix). The first two data points teased it — the gap looked like it grew from ~9pp at Q8 to ~12pp at Q4.
Then I filled in the middle rungs, and the pattern evaporated. The thinking-on-minus-off gap across BF16/Q8/Q6/Q5/Q4 goes 5.4 / 9.3 / 4.0 / 12.9 / 11.6 points, a random walk between 4 and 13. The apparent widening was just two cherry-picked endpoints. Thinking doesn’t measurably rescue a more-quantized model, boooo. Oh well, worth a try!
Speed: where the miniPC intuition breaks
This is the part that surprised me a little. Here’s llama-bench on qwen3.6 across the ladder — prefill (processing your prompt) and decode (generating the response), tokens/sec, higher is better:
| Quant | File size | Prefill tok/s | Decode tok/s |
|---|---|---|---|
| F16 | 66.2 GB | 1,303 | 139 |
| Q8_0 | 35.2 GB | 5,473 | 154 |
| Q6_K | 27.2 GB | 4,374 | 150 |
| Q5_K_M | 23.6 GB | 5,446 | 151 |
| Q4_K_M | 20.2 GB | 5,555 | 153 |
The decode column is the one to watch, and it refuses to move. Halving the bytes per weight from Q8 to Q4 takes generation from 154 tok/s to 153, a gap the run-to-run noise swallows whole. Even full F16 only sags to 139. For everything the ladder does to file size, decode speed sits still.
Strix Halo behaved nothing like this, and this is exactly the rule I opened with coming due. There, generation is bandwidth-bound, so dropping down the quant ladder genuinely speeds it up: in the MTP post this same qwen3.6 ran about 50 tok/s at Q8 and 65 at Q4 with MTP off, a clean ~30% win for nothing but going smaller. The two boxes side by side:
| qwen3.6 decode, MTP off | Q8_0 | Q4_K_M | Q8 to Q4 |
|---|---|---|---|
| Strix Halo (gfx1151, Vulkan) | ~50 tok/s | ~65 tok/s | +30% |
| MI300X (gfx942, llama.cpp ROCm) | 154 tok/s | 153 tok/s | ~0% |
The same change that bought 30% on the miniPC buys nothing here. The MI300X carries so much HBM3 bandwidth that decode never waits on memory, even at Q8, and trimming the model further only frees up headroom the GPU was never short on. The “smaller model generates faster” logic I leaned on for three straight posts was really a fact about starved hardware all along, and it doesn’t survive contact with a card sitting this far below its bandwidth limit.
Prefill tells the other half of it. F16 prefill collapses, 1,303 tok/s against roughly 5,500 for every quant, so moving off full precision is a real ~4x win there (prefill leans on compute and bandwidth in a way decode doesn’t). Once you’re actually on the ladder, though, Q8 to Q4 prefill flattens out the same way decode does. The whole jump lives between F16 and Q8; everything below Q8 is rounding error.
The 80B coder-next shows the identical shape, just slower in absolute terms — F16 prefill/decode of 717/106, Q4 of 3,527/121. Same story: decode flat across quants, F16 prefill collapses, file size drops 3.3x (148 to 45 GiB) for free.
One weird rung: Q6_K is slower than Q8_0
Look again at the qwen3.6 table: Q6_K prefills at 4,374 tok/s, behind the larger Q8_0 at 5,473, even though its file is 8 GB smaller. coder-next repeats the trick (Q6_K 2,840 against Q8_0 3,425), so it isn’t a one-off. Q8_0 unpacks almost for free, a scale factor per block and not much else, while the K-quants carry a fussier block structure that costs real compute to dequantize on the way to the matmul. When bandwidth is the bottleneck, as on the miniPC, the smaller file still wins, because the bytes you save outweigh the compute you spend getting at them. Lift the bottleneck and the bill for that extra dequant work comes due, which is how the smaller quant ends up the slower one.
So when does quantizing a big GPU ever pay off?
If decode speed is flat and quality is flat, the obvious reaction is “so don’t bother quantizing on a card like this.” That’s the wrong lesson, and it’s worth going into why, because the benchmark above is rigged against quant in a way that’s easy to miss.
My llama-bench runs are batch-1: one request, one stream, nobody else on the GPU. That’s the latency case, and it’s exactly the regime where quant on this hardware does nothing. Here’s the math: At batch-1, generating a token means reading the active weights once, so the speed ceiling is roughly (memory bandwidth) / (bytes of active weights). For a 3B-active MoE at Q8, that’s about 3 GB of weights against the MI300X’s 5,325 GB/s of HBM3: a ceiling near 1,700 tok/s. I measured 154. We’re under a tenth of the way to the bandwidth wall, which means single-stream decode here is bound by overhead (kernel launches, MoE routing, attention), not by hauling weights. Halve the weight bytes with a heavier quant and you’ve cut a cost that wasn’t the bottleneck, so nothing moves.
The miniPC lives on the far side of that line. Its ~256 GB/s of LPDDR5, roughly 21x less bandwidth than the MI300X, drops the same Q8 ceiling to about 85 tok/s, and in practice it generates around 50. That machine is pinned against the wall, so shaving bytes per token nearly doubles its ceiling and you feel every byte of it.
So back to the point: when does it pay off on a big GPU? Three real cases, none of which my batch-1 latency test can see:
- To fit on fewer GPUs, ideally one. This is about capacity, not speed, and for a big enough model it can be the biggest practical win on the list, even if neither of my two ever triggers it: qwen3.6 is 66 GB at F16 and coder-next is 148 GiB, so both fit a single 192 GB card at full precision and I never had to quantize either just to make it fit. It bites a tier up, on models too large for one card, where halving the bytes per weight is the difference between needing two cards and one. Collapsing a model onto a single GPU sidesteps tensor-parallel entirely, no cross-GPU comms on the critical path, which is usually simpler and faster per request than splitting it across several, though a future post will hopefully demonstrate that with real numbers (the multi-GPU story on this box could be its own writeup). For the models in this post it’s moot; once you’re past what a single card holds, quantizing to fit is often the highest-value move available.
- To free memory for KV cache, which is long context and high concurrency. Weights and the KV cache share the same 192 GB. A 35B model barely dents it, but a 120B-plus model serving many users at long context becomes KV-bound, not weight-bound, and every gigabyte you don’t spend on weights is a gigabyte you can spend on more concurrent sequences or more context. Single-stream latency stays flat, but aggregate throughput climbs because you’re fitting more work on the card at once. One sharp caveat, though, and it’s the same bandwidth lesson again: this means quantizing the weights to make room for an f16 KV cache, not quantizing the KV cache itself. I checked that separately, and on this hardware quantizing the cache is its own “smaller is slower” pitfall: a q8_0 KV cache costs you ~27% of decode at 256K context and a q4_0 cache ~59%, because attention has to dequantize the cache on every step and the bandwidth that would make that free is exactly what this card has to spare. Quantize KV to fit something that otherwise wouldn’t; never quantize it expecting speed.
- For throughput at scale, a native low-precision format like FP8, not a K-quant. As you batch up requests, decode stops being overhead-bound and becomes compute-bound, the matmuls get big enough to saturate the cores. On paper FP8 wins there on raw math: the MI300X does FP8 matrix multiply at 2,615 TFLOPS against 1,307 for BF16, a clean 2x. That’s a compute win with nothing to do with memory, and GGUF K-quants can’t touch it because they dequantize back to fp16 before the matmul. So I measured it, and there’s a real gotcha hiding in the word “FP8,” see the next section.
Measured: FP8 wastes most of its advantage by default
I ran a concurrency sweep instead of asserting claim #3: the same qwen3.6-35B, a fixed 512-in/128-out workload, closed-loop at 1/8/32/64/128 concurrent requests. BF16 and FP8 on vLLM, Q4_K_M on llama.cpp. The first FP8 column is the obvious invocation, --quantization fp8. The second adds one environment variable, VLLM_ROCM_USE_AITER=1, which turns on AMD’s optimized kernels. Output tokens/sec, higher is better:
| Concurrency | BF16 | FP8 (default) | FP8 + AITER | Q4_K_M (llama.cpp) |
|---|---|---|---|---|
| 1 | 130 | 104 | 121 | 114 |
| 8 | 563 | 515 | 666 | 244 (28 reqs failed) |
| 32 | 1,491 | 1,493 | 1,932 | 270 (39 failed) |
| 64 | 1,997 | 2,135 | 2,917 | 255 (36 failed) |
| 128 | 2,621 | 2,792 | 3,637 | 277 (7 failed) |
Default FP8 is a trap on this hardware. The obvious invocation, --quantization fp8 and nothing else, is slower than just running BF16 across the low and middle of the range a single self-hosted endpoint actually runs at: 20% slower single-stream, 9% slower at eight concurrent, dead even at 32, and a measly 7% ahead even at 128. That’s a thoroughly underwhelming showing for a format with twice BF16’s theoretical FLOPS. It matches open reports of FP8 underperforming BF16 on MI300X (vLLM #31475): the stock FP8 path on ROCm leaves the win on the floor.
One flag recovers most of it. Flip on AMD’s AITER kernels (VLLM_ROCM_USE_AITER=1) and FP8 goes from “why did I bother” to 18% faster than BF16 at eight concurrent, 30% at 32, and 46% at 64. Same weights, same precision, same vLLM, one env var, up to a 46% throughput difference. If you run an FP8 endpoint on an MI300X without AITER, you are leaving roughly a quarter of your serving capacity unused and would have been faster staying on BF16 at low load. (I also tried a pre-quantized static-scale FP8 checkpoint, Qwen’s official one, in case dynamic scaling was the culprit. It made no difference. AITER was the whole story.)
So what is the flag actually doing? AITER is AMD’s pile of hand-tuned inference kernels for ROCm, a mix of assembly, Composable Kernel, and Triton; the ones that matter here are its FP8 GEMM and fused-MoE kernels. Stock vLLM on ROCm runs FP8 through generic kernels that don’t squeeze the MI300X’s FP8 throughput, and AITER swaps in kernels actually tuned for this silicon, for the matmuls and, on an MoE, the expert routing. It’s a kernel change, not a precision change, so it’s free on the quality side: you’re running the identical FP8 weights, just through math that fits the hardware (the same FP8 that lands at 59.1 Polyglot, within noise of BF16).
The catch, before you build on it: AITER is a moving target, not a set-and-forget switch. Coverage is uneven, the tuned kernels exist for some models and ops and not others (AMD’s own launch numbers were scoped to DeepSeek), and which ones are fast shifts version to version with the ROCm/vLLM container; and the FP8 win in particular only lands on CDNA3-class cards like this one, which have the native FP8 units to begin with. The honest read is “free speed for this model, on this container, today,” which is why I measured it on qwen3.6 instead of assuming the flag behaves the same everywhere.
But even tuned, it’s not the 2x in my usecase. AITER FP8 tops out around +46%, not the +100% the FLOPS spec implies. The reason is the model: a 35B-A3B MoE only fires ~3B parameters per token, so the FP8 matmul speedup applies to a sliver of the work while routing and expert-selection overhead, which FP8 does nothing for, takes a bigger share of each token. A dense model would likely see more of the 2x. So the honest ceiling for this class of model is “a third to a half faster, with the right kernels,” which is still well worth having.
And llama.cpp is a single-stream tool, not a serving stack. At one request it’s competitive (114 tok/s, ahead of default FP8). The instant you add concurrency it collapses: throughput flatlines around 270 tok/s no matter the load (vLLM scales past 3,600 with AITER, more than 10x), it starts dropping requests, and tail latency falls apart, mean time-to-first-token at 128 concurrent was over twelve seconds. GGUF on llama.cpp is the right tool for the single-user, fits-on-my-box case this whole series is about; put it under a production serving load and you want vLLM. (The failed-request counts wobble enough that I read them as “it fell over,” not a precise capacity number, but the throughput plateau and the latency blowup are unambiguous. This makes me sad, because honestly I kind of hate using vLLM hah.)
So claim #3 holds, with a big asterisk: a native format does win the compute regime where a K-quant can’t, but only if you turn on the kernels that make it work. The default doesn’t, and the gap between the default and the tuned path is bigger than the gap between FP8 and BF16 in the first place.
The thing that genuinely does not help, on any of these axes, is grabbing a bigger GGUF K-quant and expecting single-stream generation to speed up. It won’t, and as the Q6_K rung showed, it can go backwards.
So what do you actually pick
If you’re running one of these models on an MI300X (or anything with comparable HBM bandwidth), the takeaways are clean:
- Quantize to Q4_K_M without quality guilt. Two models, four benchmarks, BF16 to Q4 is flat. You’re not trading smarts for size in any way I could measure.
- Quantize for capacity, not latency. On this hardware the reason to shrink a model is to fit it on fewer GPUs (ideally one, dodging tensor-parallel) or to free memory for KV cache, not to make a single stream generate faster. Q4 gets you a 3.3x smaller model and ~4x faster prefill versus F16, but Q8-to-Q4 will not move your decode speed, because decode was never bandwidth-bound here. If single-stream tokens/sec is all you care about, Q8 and Q4 are a wash; pick on footprint.
- Don’t assume “smaller quant = faster” transfers from consumer hardware. It’s a bandwidth-starvation effect, and this box isn’t starved. Q6_K being slower than Q8_0 here is the canary: dequant compute can cost more than the bytes you saved.
- FP8 for footprint always, for throughput only with AITER. FP8 lands within noise of BF16 on quality (59.1 vs 59.6 Polyglot) and halves the model’s memory, reason enough on its own. For speed, the issue is the kernels: stock
--quantization fp8is slower than BF16 below ~32 concurrent, butVLLM_ROCM_USE_AITER=1flips it to 30–46% faster under load. If you serve FP8 on MI300X, turn AITER on, it’s the difference between FP8 being a downgrade and a real win. And if you’re low-concurrency, BF16 is still faster single-stream, so don’t reach for FP8 expecting latency.
The throughline for the series holds and extends: quality is hardware-agnostic and quant-agnostic down to Q4. What’s hardware-specific is the speed you get for quantizing — and on a GPU that isn’t starved for bandwidth, that speed is mostly already on the table at Q8. The interesting question was never “how smart is the smaller model” (just as smart). It’s “what does the smaller model buy you on this box” — and the answer flips depending on whether the box was ever hungry for bandwidth in the first place.
Caveats and fine print
- TP=1 throughout. Everything here ran on a single MI300X, which is all these models need (they fit one card at every precision). Multi-GPU on this box is its own saga — 2- and 4-way tensor parallel do work once you get the boot flags and RCCL shared memory sorted, and the full multi-GPU story is a post of its own. Nothing here needed more than one GPU.
- The BF16/FP8 anchors are vLLM, the ladder is llama.cpp. As noted up top, I only draw the quant-curve conclusions from the Q8 to Q4 llama.cpp runs, which are internally consistent. Treat the BF16 and FP8 points as reference, not as rungs on the same ruler.
- Polyglot noise is real. 225 exercises is enough to rank models a tier apart but not enough to resolve sub-2-point quant differences — which is exactly why I’m calling the ladder “flat” rather than reading its wiggles as signal. The lm_eval breadth (sub-1pp everywhere) is what makes me confident it’s genuinely flat and not just Polyglot being coarse.
- This is two MoE models. Both are A3B (~3B active). A dense model, or a much larger active-parameter count, could push decode back toward bandwidth-bound even on this hardware. If you’re on different architecture, measure your own decode column before trusting the “flat” claim.