In the MTP post I got Multi-Token Prediction working on the Strix Halo miniPC, squeezed an 18-26% speedup out of qwen3.6 once I stopped using the default that made it 3x slower, and then handed out some advice. The win was probably specific to bandwidth-starved hardware, I said, so if you’re on a 3090 or an A100, don’t expect these numbers. I also said that if you measured MTP on a different GPU, you should publish what you got. But here’s the thing… I OWN a 3090. Me not doing that myself was just lazy! So I took my own advice and ran it on two more cards - an MI300X datacenter GPU and an RTX 3090 and the 3090 turned in +41%, the biggest speedup in the whole series, on the exact card I’d told you to skip. Err… My bad?

The MI300X behaved as expected: MTP made qwen3.6 slower at every setting. And the part that should have tipped me off sooner is that all three machines - Strix, 3090, MI300X - accepted the draft head’s guesses about 90% of the time. Same trick, same acceptance rate, and the outcome ran from a 41% win to a loss at every setting based on nothing but which GPU it ran on. My mistake was reasoning about memory bandwidth, when bandwidth is only a proxy for the variable that actually decides this - a variable that’s been sitting in the original speculative-decoding paper since 2023, which I’d have known if I’d googled before I started giving advice ;)

A brief refresher

Speculative decoding speeds up generation by having a small, fast predictor guess several tokens ahead, then letting the full model verify the whole batch in one forward pass. Every accepted guess is a token you got nearly for free. MTP is the self-speculative version: the model ships with a lightweight extra head trained to predict a few tokens out, so the draft and the verify come from the same model. The knob that matters is --spec-draft-n-max, how many tokens the head drafts before the model checks its work. On the miniPC the sweet spot was 2-4; the default of 16 was a disaster. (All of that is in the first post; this one assumes it.)

The catch with spec decoding is that every rejected guess is wasted compute, both the draft’s and the verify’s. The whole thing only pays off if the guesses are accepted often enough that the saved steps outweigh the wasted work. So you check the acceptance rate. Which is where these results stop making sense.

The MI300X: slower everywhere

Same model (qwen3.6-35B-A3B), same llama.cpp, single MI300X, decode tokens/sec, higher is better. Methodology: a fixed ~80-token prompt, 256 tokens generated, 3 reps averaged with a warm-up discarded, via llama-server’s /completion endpoint. I swept --spec-draft-n-max from 1 to 16 against the MTP-off baseline, for both a Q8_0 and a Q4_K_M trunk:

n_max Q8_0 tok/s vs off Q4_K_M tok/s vs off
off 145.4 - 143.1 -
1 124.8 -14% 135.1 -6%
2 125.3 -14% 130.7 -9%
3 139.8 -4% 126.3 -12%
4 133.2 -8% 126.6 -11%
8 124.1 -15% 118.7 -17%
16 (default) 115.7 -20% 106.5 -26%

No wins there. The least-bad config (Q8_0 at n_max=3) still loses 4%, and that one had enough run-to-run variance that I’d call it a wash at best. The default n_max=16 is a 20-26% regression, reproducing the same “the default is a trap” finding from the miniPC, just without any good configuration on the other side of it to redeem it. On this hardware there is no setting I’ve found where MTP comes out ahead.

The 3090: where Damen was wrong

And now for something completely different. RTX 3090, 24 GB, CUDA build of the same llama.cpp, same qwen3.6-35B-A3B at Q4_K_M with the matching MTP head. Greedy decode so MTP-on produces byte-identical output to MTP-off (with greedy, speculative decoding is exact, so this is a controlled same-workload speed test), 256 tokens, three reps each. Decode tokens/sec:

n_max tok/s vs off
off 167 -
1 217 +30%
2 236 +41%
3 233 +39%
4 234 +40%
8 170 +1%
16 128 -24%

That is a real, large win, and the reason I’d bet against it is worth digging into a little because past me wasn’t REALLY just being lazy. I’d cited an early community writeup that swept 19 spec-decode configs on a 3090 and found none faster than baseline, and that’s what I leaned on when I told you to skip it. In fairness to “past Damen”, that writeup wasn’t even testing the same mechanism: it swept a separate draft model plus ngram methods, not the self-speculative MTP head this post is about, and its own author was careful to call the negative result engine-specific to llama.cpp’s draft path - not a property of the card or the model - having seen vLLM’s MTP come out ahead. Run through the draft-mtp path, MTP on this 3090 turns out to be the best result in the series. I’ll happily take being wrong in this direction.

The peak is +41% at n_max=2, with a comfortable plateau across 2 to 4. The acceptance rate at the peak was 90.6% (164 of 181 drafted tokens accepted, measured server-side; the throughput numbers above come from the greedy llama-cli runs, so acceptance and speed are from separate invocations of the same config rather than one combined measurement). The off baseline (167 tok/s) matches an independent llama-bench run to within noise, so the speedup is genuine and not a harness artifact. Two things jump out against the MI300X. The win is bigger than the miniPC’s +18-26%, on a card with almost four times the bandwidth, which is backwards from any simple bandwidth story. And the n_max=16 default tanks it here too, down 24%.

The part that didn’t add up (until it did)

The strange part was the acceptance. It was excellent everywhere: 82-93% on the MI300X, 90.6% on the 3090, similarly high on the Strix. And that same near-90% bought a 41% speedup on the 3090, +18-26% on the miniPC, and a slowdown on the MI300X. So “is the speculation working?” (yes, accepted everywhere) turns out to be a completely different question from “is it helping?” Acceptance tells you the head is good. It says nothing about whether the saved steps are worth the added compute, and that depends entirely on the hardware.

Why: bandwidth was a proxy

This is the part I got wrong, and it’s textbook, which somehow makes it worse. I went looking in memory bandwidth, and there is a bandwidth story here, but it’s a stand-in for something more precise that’s been written down since the technique was invented.

Speculative decoding has a closed-form speedup model, from Leviathan et al. (2023), the paper that introduced it. The expected wall-clock speedup is:

       1 - α^(γ+1)
S = ──────────────────
     (1 - α)(γc + 1)

Three knobs. α is the acceptance rate - how often a drafted token survives verification. γ is how many tokens you draft per step (our --spec-draft-n-max). And c is the one I had been ignoring: the cost ratio, how long one draft pass takes relative to one verify pass. The catch is that it’s platform-dependent, a property of the model and the hardware it runs on, not the model alone.

That fact is… kind of the whole thing here. Acceptance only sets the numerator - how many tokens a good round buys you; c sets the denominator, what that round costs. I’d been obsessing over the numerator and ignoring the denominator the hardware quietly controls.

So I did the thing I should have done first: I took our measured α, γ, and speedup on each card and solved the formula backwards for the c that each result implies. At the matched setting - γ=2, acceptance ~91% on both - the cost ratio is the entire story:

Card acceptance α observed speedup implied cost ratio c
RTX 3090 (CUDA) 0.906 1.41× (+41%) ≈ 0.47
MI300X, Q4_K_M (ROCm) 0.914 0.91× (-9%) ≈ 1.0
MI300X, Q8_0 (ROCm) 0.904 0.86× (-14%) ≈ 1.08

There it is. At the same acceptance the 3090’s draft costs under half a verify pass while the MI300X’s effective cost is more than a whole one. A draft that is cheap relative to verification (c well under 1) leaves room for the accepted tokens to net out ahead. A draft that costs as much as the verification it is trying to save (c near or above 1) can never come out ahead no matter how good the acceptance - you can read it straight off the formula, where once γc + 1 outgrows the numerator, S drops below 1. The break-even c at ~91% acceptance is around 0.86; the 3090 sits comfortably under it, the MI300X over it.

So why is the MI300X’s c so much higher? Bandwidth comes back in here, not as the answer but as the thing that sets c. Verification batches γ+1 tokens into one forward pass, and that pass is only cheap (only about one decode step instead of γ+1 of them) if its dominant cost amortizes across the batch. Streaming the weights is a per-pass cost: load them once, verify the whole batch, done. So when decode is memory-bandwidth-bound, a γ-token verify costs about what a 1-token decode costs, the denominator’s “1” holds, and c stays small. That is the 3090. But MoE routing and attention are per-token costs - a four-token verify does four tokens’ worth of both, batched or not. On the MI300X, single-stream decode of a 3B-active MoE is not bandwidth-bound at all (it sits under 10% of the card’s 5.3 TB/s, bottlenecked on that per-token overhead instead), so verification does not amortize, the effective “1” in the denominator behaves more like γ+1, and that violation gets folded straight into c. The MI300X is just paying per-token where the 3090 pays per-pass.

And this is why a result that looks like a flat contradiction isn’t one. AMD has published up to 3× from speculative decoding on the very same MI300X, at batch size 1 - the exact configuration where I measured a loss. The difference is the model. Their wins are on dense 34B-to-70B models, and a dense 70B at batch-1 genuinely is bandwidth-bound: streaming seventy billion parameters per token is an enormous per-pass cost that amortizes beautifully, so its c stays small and the formula pays out. qwen3.6 fires only ~3B parameters per token, so it never gets bandwidth-bound on a 5.3 TB/s card, its verify will not amortize, and its c balloons. The model’s active size - not the card - decides whether you are paying per-pass or per-token. (The MoE speculative-decoding literature describes the same mechanism from the other side: under speculation a γ-token verify can activate the union of all experts those tokens route to, inflating exactly the per-token cost that will not amortize - see e.g. MoE-Spec.)

One caveat on that c column before I move on: I solved for c rather than measuring it directly, so it is an effective ratio. It lumps a genuinely expensive draft together with verification that simply didn’t amortize, and only a draft-pass/verify-pass timing breakdown, which I haven’t run, would separate the two. (More on the cross-backend caveats at the end.)

None of this is new theory. The Leviathan formula is the founding paper, and the roofline reading of it - amortize per-pass costs, pay per-token costs in full - is well-trodden. What three boxes running the identical model show is the half that’s easy to forget: at fixed acceptance, the speedup is whatever the cost ratio says, and that ratio swings by more than 2× from card to card. Bandwidth was never the variable. It was a proxy for c, and the Strix is proof the proxy leaks.

What to take from it

  • The speedup has a formula, and acceptance is only one of its three terms. Whether MTP helps is set by acceptance α, draft depth γ, and the platform cost ratio c = draft-pass time over verify-pass time. c is the one that varies by hardware, and it is what decides whether the same ~90% acceptance turns into +41% or -14%. If you remember one thing past “tune n_max,” remember that the cost ratio, not the bandwidth, is what you are really up against.
  • Don’t trust acceptance rate as a go/no-go signal. Roughly 90% acceptance bought +41% on one card and a net loss on another, because acceptance only sets the formula’s numerator. Measure end-to-end tokens/sec against the MTP-off baseline; that is the only number that decides it.
  • The n_max=16 default is bad everywhere. It was a 3x regression on the miniPC and a 20-26% one on both the MI300X and the 3090. Wherever MTP runs, the sweet spot was 2 to 4 and the default was wrong. If you turn spec decoding on, that is the first knob to move.

The card that benefitted most was the mid-range one, because its draft was cheap relative to its verify - which bandwidth only loosely predicts. It was never about small versus big. It’s the cost ratio, and that’s a thing you can measure instead of guess.

So: if you read the last post and skipped MTP on your 3090 - sorry, go turn it on.

Caveats

  • Single-stream, batch-1. This is the latency case. Under heavy concurrent batching the decode regime changes (it becomes compute-bound for a different reason), and the spec-decode tradeoff would need its own measurement. These conclusions are specific to the one-request-at-a-time case.
  • The workload was CoT-heavy, which flatters MTP. All three machines were measured on a thinking-model prompt that generates long, structured chains of reasoning, and that kind of text is unusually self-predictable, which pushes acceptance toward 90% and the speedup with it. The relative comparison across machines holds, because they all saw the same favorable workload, but the absolute gains would shrink on less predictable output. Read +41% as the top of the range for this card, not a universal number.
  • The three rows are not a controlled cross-hardware A/B, and c lives in the software stack as much as the silicon. Strix ran Vulkan, the MI300X ROCm, the 3090 CUDA, at differing quants, so the cross-card magnitudes carry the usual cross-backend caveat; what holds up is the sign and rough size of each result plus the within-machine off-vs-on comparisons, which were measured identically per card. The Strix is the live example of the backend mattering: an open llama.cpp issue (#23126) reports its Vulkan path on unified-memory iGPUs pays a draft/target synchronization penalty that inflates c for backend reasons, not hardware ones (it’s filed for a separate draft model, not MTP, so it may not transfer). It’s also why I don’t quote a back-computed c for the Strix at all: I never captured a clean acceptance number for it, so any c would be invented.
  • One model, one architecture. qwen3.6 is a 3B-active MoE, which is part of why single-stream decode is so overhead-bound on the MI300X. A dense model would shift the compute-versus-bandwidth balance, and where it landed on each card is its own measurement.