MTP Speculative Decoding on Strix Halo: How I Made It 3x Slower Before I Made It Faster

In the last post I landed on qwen3.6 as the most usable coding model I could actually run on the miniPC (a Strix Halo box: AMD Ryzen AI MAX+ 395, 128 GiB of unified memory, Vulkan). This is a MUCH shorter follow-up about squeezing more tokens/sec out of it with MTP, and about the ditch I drove into on the way.

The three-sentence version: turned on with its default settings, MTP made generation 3x slower! Tuned for this model, it’s about 18-26% faster. The difference between the two is a single number.

Contents

What the heck is MTP??
The wrong path
The knob that matters: --spec-draft-n-max
The config I'd actually use
What about the faster IQ4_XS build?
This is a Strix Halo / Vulkan result
The takeaway

What the heck is MTP??

Normally a model generates one token per forward pass: run the whole network, get one token, repeat. That’s slow. Speculative decoding speeds it up by first having a small, fast predictor - something far cheaper to run than the full model - guess the next several tokens, and then letting the full model verify that whole batch of guesses in a single forward pass. The trick is that checking several tokens at once costs the big model about the same as generating one. Every guess it accepts is a token you got essentially for free. The catch, of course, is that every rejected guess is wasted compute, both the draft’s and the verify’s. The math only pays off if the guesses are accepted often.

MTP (Multi-Token Prediction) is the self-speculative version: instead of running a second small “draft” model alongside the big one, the model ships with an extra lightweight head trained to predict a few tokens ahead. The draft and the verify come from the same model. llama.cpp added support in PR #22673 (merged 2026-05-16, build b9180 or later), exposed as --spec-type draft-mtp.

Qwen3.6 here is a 35B-A3B Mixture-of-Experts model: 35B total parameters, but only ~3B are active per token. That “A3B” part turns out to matter a lot for whether MTP helps.

The wrong path

The first thing I did, which was the first thing and not the smart thing, was the naive run: flip MTP on, leave everything at defaults, and see what stock settings buy you. It’s what most people will reach for, and the articles all quote ~2x, so why not? Generation dropped to 18.6 tok/s, down from ~60 with MTP off. So, the opposite of 2x. Definitely didn’t seem right lol.

The culprit was --spec-draft-n-max, the number of tokens the head is allowed to draft ahead before the model checks its work. It defaults to 16. Here’s what it gets you at a range of values (tok/s, higher is better):

MTP config (qwen3.6-35B-A3B, Unsloth Q4_K_M - early run)	tok/s
off (baseline)	59.7
on, `--spec-draft-n-max 16` (the default)	18.6
on, `--spec-draft-n-max 8`	27.6
on, `--spec-draft-n-max 4`	66.9

A caveat on these numbers: they were quick and exploratory. I killed off any stray model servers but skipped the full services-down, fans-pinned protocol I used for the recommended config below, so read the absolute baseline loosely. (It also idles at ~60 here rather than the ~65 you’ll see later because this early run used a slightly slower quant upload. Why two “Q4_K_M” files run at different speeds is its own rabbit hole, maybe a future companion post.) A 3x regression dwarfs either effect, which is the whole point.

So there we have it: for me, the default was a roughly 3x regression, and the line between useless and useful is narrow. n=8 is still slower than no MTP at all, and only by n=4 does it pull ahead. I almost wrote the whole thing up as “MTP doesn’t work on Strix Halo MoE” right there. Then I did the thing I should’ve led with if I’d actually wanted good numbers: went and read how MTP works.

The knob that matters: --spec-draft-n-max

The default of 16 is calibrated for dense, instruction-tuned models, which accept long draft runs. An A3B MoE is the opposite: only ~3B parameters fire per token, the head’s predictions get rejected sooner, and every rejected draft past the acceptance point is pure waste. The community guidance for A3B-class MoEs converges on n=2 or 3. Even llama.cpp’s own MTP pull request reports its best results around 3 draft tokens at roughly 75% steady-state acceptance, nowhere near the default 16, and acceptance only falls faster as you push past that.

The other thing worth saying plainly: the reported 2x speedup you may have seen for “Qwen3.6 + MTP” comes from other setups (the PR author quotes >2x, testing on a different stack than a power-capped Strix Halo). On this box, for the 35B-A3B MoE on Vulkan, the realistic ceiling I measured is more like 1.2x. Lower acceptance means less free speed. That’s not a failure, it’s just the honest number for this class of model on this hardware, and it’s still worth having.

There’s also a second knob, --spec-draft-p-min (the minimum probability the head needs before a draft is even attempted, default 0.75). Some guides call it the most impactful parameter. On my hardware, sweeping it from 0.5 to 0.9 stayed within sampling noise, so I left it at the default. Your mileage may vary; it’s worth a quick sweep, but n_max is the one that actually moved my numbers.

The config I'd actually use

Two setups: pick by whether you care more about disk or about the last drop of quality. Both are gfx1151 on Vulkan (see the caveat at the end if you’re on something else), same clean-bench methodology as the last post.

The build. You need llama.cpp with PR #22673, so b9180 or later. Check with llama-server --help and look for draft-mtp in the --spec-type modes. Build it Vulkan-only: there’s a known foot-gun (issue #23199) where, in a dual Vulkan+ROCm build, the MTP tensors get placed on the ROCm device and MTP is silently disabled even when you asked for Vulkan. I didn’t run into this thankfully (my build is Vulkan-only), but watch out.

Q4_K_M (smaller, faster, what I’d default to). Two files from bartowski/Qwen_Qwen3.6-35B-A3B-GGUF: the model (Qwen_Qwen3.6-35B-A3B-Q4_K_M.gguf, ~20 GiB) and the MTP head as a separate file (mtp-Qwen_Qwen3.6-35B-A3B-Q4_0.gguf, ~1 GiB).

llama-server \
  --model Qwen_Qwen3.6-35B-A3B-Q4_K_M.gguf \
  --model-draft mtp-Qwen_Qwen3.6-35B-A3B-Q4_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ngl 999 -c 8192 -fa on --parallel 1 \
  -t 32 -tb 32 -ub 2048 \
  -ctk q8_0 -ctv q8_0 \
  --reasoning off \
  --host 0.0.0.0 --port 8089 --alias qwen3.6

That gets ~77 tok/s, versus ~65 with the --model-draft and --spec-type lines removed. About +18%.

Q8_0 (more disk, a hair more quality). Same flags, the Q8_0 trunk and head (Qwen_Qwen3.6-35B-A3B-Q8_0.gguf ~35 GiB, mtp-Qwen_Qwen3.6-35B-A3B-Q8_0.gguf ~2 GiB), and one important change: --spec-draft-n-max 2 instead of 3. Expected: ~63 tok/s, versus ~50 with MTP off. About +26%.

The sweet spot drops to 2 at Q8 because the trunk’s own predictions are more confident, so it accepts fewer drafts from the head, and drafting a third token just wastes work. Sweep n_max (1, 2, 3, 4) on whatever model and quant combo you land on and use whichever wins.

The head quant matters less than you’d think, and lighter is better. A Q4_0 head on the Q4_K_M trunk hit 77 tok/s; a Q8_0 head on the same trunk did 74. The bigger head’s better predictions don’t pay for the extra bandwidth they cost.

Configs that lose, for completeness: the default --spec-draft-n-max 16 (~19 tok/s, a 3x regression) and --spec-draft-n-max 8 (~28 tok/s, still slower than no MTP). The default is the trap.

If you’d rather not juggle two files, unsloth/Qwen3.6-35B-A3B-MTP-GGUF bundles the trunk and head into one (drop --model-draft, keep --spec-type draft-mtp). I measured ~75 tok/s at n=3, about 2 tok/s behind the bartowski split, with identical quality on Polyglot, gsm8k, and ifeval. Use whichever workflow you prefer.

What about the faster IQ4_XS build?

There’s an IQ4_XS build with the MTP head baked in, and a pretty wild-sounding claim for IQ4_XS+MTP on Strix Halo was making the rounds: 90.8 tok/s average, 110.6 peak. IQ4_XS is a smaller quant than Q4_K_M (about 4.25 bits per weight versus ~4.8), and since generation is bandwidth-bound, smaller CAN mean faster, so it’s a plausible claim. I downloaded it and benched it the same way as everything above, but I was not able to reproduce it.

IQ4_XS + MTP (n=2), my box	tok/s
q8_0 KV cache	~78
f16 KV cache	~81
the number I was chasing	90.8 avg / 110.6 peak

Apples-to-apples, at the q8_0 KV cache my recommended config uses, IQ4_XS lands around 78 tok/s, a hair over the Q4_K_M + MTP setup above (~77) but inside the noise. Switching IQ4_XS to an f16 KV cache pushes it to ~81. That is a real few percent, but it is the cache talking, not the quant: f16 would lift the Q4_K_M numbers the same way. So IQ4_XS earns you a little (a smaller file), and a little more if you spend the extra memory on an f16 cache, but it is not the different league the 90+ figure implies. A raw llama-bench pass on the file came in at 73, so that is not where the headline comes from either.

My best guess for the gap to 90.8: I cap this box at 100W with ryzenadj for round-the-clock thermal stability (the “I melted it twice” saga from the last post). Run the chip hotter, or on a newer build, and you would probably claw some of it back. Worth a shot if you have the thermal headroom. But on a power-limited Strix Halo it is a modest step over the Q4_K_M + MTP config, not a leap, and nothing I measured got close to 90. There are some thermal upgrades I may attempt to make, and maybe I’ll revisit this with a higher cap if I do.

This is a Strix Halo / Vulkan result

The numbers above are gfx1151 on Vulkan. On CUDA, an early community writeup for this same model found no net speedup from llama.cpp’s speculative-decoding paths: it tested 19 configurations on an RTX 3090 and found none faster than baseline (the same author’s HackMD notes lay out the detail). Note that’s a llama.cpp-specific result, the same author found vLLM’s MTP faster on the same card. So the speedup here may be specific to the Vulkan path or to the unified-memory architecture. If you’re on a 3090 or an A100, don’t expect these numbers (Damen's Update: This is TERRIBLE ADVICE!!!), and if you measure your own, publish them: there’s a bit of public A3B-MoE spec-decode data now (mostly CUDA, plus a Strix Halo ROCm run or two), but I couldn’t find a single A3B-on-gfx1151-via-Vulkan benchmark out there, so that corner is wide open.

The takeaway

MTP on this hardware is a real but modest gain: ~18-26%, not the 2x you’ll see quoted for the dense model. And the default config is actively harmful on an A3B MoE. If you take one thing from this: turn --spec-draft-n-max down to 2 or 3 before you decide whether MTP works for you.