Do the wins stack? MTP on top of sliding-window attention
TL;DR: Yes! With a few very large asterisks, you CAN stack MTP and SWA. You need a patch for llama.cpp (below), and the payoff depends entirely on what you’re generating: on predictable output the stack is spectacular (+46% decode at 123K), but on open-ended generation MTP still loses money even with SWA under it.
Contents
Where the last post left off
In the 256K post I got usable long context out of my Strix Halo box by marking the hybrid’s ten full-attention layers sliding-window, keeping a few of them full as a far-recall backstop. Bounded KV, flat prefill, decode that holds up at depth. That post deliberately left one thread on the table: whether the other decode lever still works on top of it. I mean one win is cool but two is better, right?
That other lever is MTP - multi-token prediction, llama.cpp’s self-speculative decode - which I previously got working on this box, then gave bad advice about, then apologized for. The short version of those two posts: MTP drafts a few tokens ahead with a lightweight head and verifies them in one pass, the payoff is governed by the Leviathan et al. formula, and whether it helps depends on the acceptance rate α, the draft depth γ, and the platform cost ratio c. On this box, with the right --spec-draft-n-max, it’s a real win.
So: SWA bounds the KV and keeps decode fast at depth. MTP amortizes the weight reads. They attack different costs so they should stack, you’d think. You can see where this is going.
(A confession: if you applied the full fork patch from the 256K post, or read its diff unusually closely, you may have noticed a small MTP fix sitting in there that the post never mentioned. That was this. It shipped quietly because it belonged to a post I hadn’t written yet - this one - and because I was too lazy to rip it out just for that post lol)
The crash
They did not stack. They didn’t even underperform together, which would have at least been interesting data. Turning both on aborted at load:
GGML_ASSERT(hparams.swa_type == LLAMA_SWA_TYPE_NONE
&& "Use llama_kv_cache_iswa for SWA") failed
src/llama-graph.cpp:2704, via llama_model_qwen35moe::build_arch_graph
Well that is annoying. Here I finally remembered to do the smart thing: google before hand-patching. The search turned up issue #23322 (people combining SWA and MTP on the Qwen3.6 family, hitting a different problem) and a pair of upstream iSWA crash fixes (#24294, #23131) that my fork’s base predated. Which raised an awkward possibility: maybe upstream had already fixed my crash while I wasn’t looking.
Llama.cpp moves FAST
My fork was 638 commits behind master… not ideal. There was no way to find out whether the crash was already fixed except to pay the debt, so I rebased the whole patch set onto current upstream - through an API migration that had renamed or restructured basically everything my patches touched. Two of my patches turned out to be obsolete and got dropped outright: the original env-gated sparse-mask hack (superseded by the SWA recipe it grew into) and my old MTP graph routing (upstream had refactored MTP properly in #23643).
The answer, after all that: no. Current upstream, same crash. The two upstream iSWA fixes were real but covered different holes. This one was still my problem, boo.
The actual fix
The bug is a divergence between two graphs that are supposed to agree. When the SWA recipe is on, the main graph correctly routes attention through the iSWA hybrid cache. But MTP builds its own little sub-graph for the draft head, and that sub-graph built its attention input with the non-iSWA builder - the one that opens with the assert above. The main graph and the MTP graph disagreed about what kind of cache the model was running on, and the assert is what disagreement looks like.
The fix mirrors the main graph: when swa_type != NONE, build the MTP sub-graph’s attention input through the iSWA builder too. The MTP layer’s index isn’t marked as a sliding-window layer, so it lands on the full-attention sub-cache - the draft head attends the full KV, which is what you want from the thing whose guesses get verified. The fix is tiny and the non-SWA path is byte-identical to before.
Best case: super-additive at 123K
With both levers finally running in the same process, the 2x2 at ~123K context (predictable log-continuation generation, greedy, q4_1 KV; full config in the caveats):
| @123K, predictable gen | prefill t/s | decode t/s | vs dense | acceptance |
|---|---|---|---|---|
| dense | 388.0 | 37.09 | - | - |
| SWA only | 550.3 | 42.15 | +14% | - |
| dense + MTP | 361.4 | 41.04 | +11% | 0.805 |
| SWA + MTP | 469.1 | 54.14 | +46% | 0.961 |
Separately the levers buy +14% and +11%. Together they buy +46%, well past their sum, and the acceptance column says why. Under SWA, the draft head’s guesses went from being accepted 80% of the time to 96% of the time. SWA isn’t just contributing its own speedup next to MTP’s - it’s making MTP better at its job.
(Prefill note: MTP costs you some prefill wherever it runs, 7-15% in this 2x2 - the draft head is extra graph. SWA-only is the prefill champion. This post is about decode.)
Why SWA lifts acceptance
The mistakes post was all about the denominator of the Leviathan formula - the cost ratio c that the hardware sets, the thing I’d ignored while staring at acceptance. This result is the numerator’s revenge: same box, same c, and the speedup moved anyway, because SWA moved α.
The mechanism is almost embarrassingly simple once you see it. The draft head is small and shallow: its whole job is guessing what the big model will say next. Bound the big model’s effective context with a sliding window and you’ve lowered its next-token entropy - it has less history to condition on, so it becomes more predictable. A more predictable target is an easier target to draft for.
The depth behavior fits: at 20K context, dense acceptance is already ~0.88 (short contexts are easy to draft for), and MTP+SWA (70.3 t/s) just tracks MTP-only (73.8) - no lift, but no harm either. The lift only appears where the dense model’s conditioning gets long and the window’s pruning of it starts to matter.
It also matches something the long-context speculative-decoding literature already knows from the other direction. MagicDec (Together AI’s writeup) attacks long-context decode by giving the draft a StreamingLLM-style window while the target stays full-attention, and reports that a windowed draft holds high acceptance out to 100K: windowed and full-context distributions agree on most tokens. Mine is the mirror image. Here the target is the windowed one, because bounding its KV is the whole point of the SWA recipe, and the acceptance gain lands on the stock MTP head as a side effect.
The lift showed up everywhere I measured, in the same direction. Not in the same size though, and that’s where the good news stops.
Forcing the worst case
That 0.96 acceptance came from continuing a repetitive maintenance log - about the most predictable generation task that exists. Of course the draft head aced it! This series has burned me enough times that an exciting best-case number now triggers a panic reflex instead of a victory lap: go find the workload that hates it.
The test: a diverse ~79K context (combinatorial prose, nothing repeating) and a genuinely open-ended analytical generation task, with generation diversity measured on both sides so I could confirm the two configs were doing comparably hard work rather than one of them quietly degenerating into an easy loop. Same 2x2:
| @79K, open-ended gen | decode t/s | acceptance | MTP’s delta vs its own baseline |
|---|---|---|---|
| dense | 41.13 | - | - |
| dense + MTP | 36.77 | 0.510 | -11%, net-negative |
| SWA only | 44.66 | - | - |
| SWA + MTP | 43.51 | 0.549 | -2.6%, still negative |
On hard generation, MTP does not pay off. Not dense, and not with SWA either. The acceptance lift is still there (0.51 to 0.55, same direction as always) but it’s small, and 0.55 is nowhere near the ~0.8 acceptance this box needs before the draft-and-verify overhead nets out ahead. SWA shrank MTP’s loss from -11% to -2.6%. Shrinking a loss is nice but not exactly much of a win.
SWA lifts MTP acceptance everywhere - the mechanism is real - but the size of the lift tracks how predictable the generation already was: 0.80 to 0.96 on the log, 0.51 to 0.55 on the analysis. SWA lowers the predictability threshold at which MTP breaks even. It does not make MTP universal, because nothing does - the acceptance gate never goes away, and your workload is what sets it.
Meanwhile, look at the SWA-only rows in both tables. +14% at 123K, +8.6% at 79K, no acceptance column, no conditions. That’s the lever that doesn’t care what you’re generating.
The rule
- Take the SWA recipe’s decode win always. It’s bounded-KV physics, generation-agnostic, and it’s the reason the 256K post exists.
- Add MTP on top only when the generation is predictable or structured: code, formatted continuation, extraction, log-like output. There it’s spectacular: +46% at 123K.
- Leave MTP off for open-ended, creative, or analytical generation, even with SWA. It was net-negative dense and it’s still net-negative stacked; SWA just made it hurt less.
- If you only remember one thing: acceptance is workload, cost ratio is hardware, and you need both on your side.
Getting the patch
As confessed at the top: if you’re running the full fork patch from the 256K post, you’ve had this fix all along. It’s also available standalone as mtp-iswa.patch; it applies clean to current upstream.
I’m not sending this one upstream, at least for now, because stock Qwen3.6 GGUFs don’t set swa_type on this arch, so a vanilla model never hits this assert. You need a downstream SWA config, like mine, to reach the path. That makes it a latent-inconsistency fix - the MTP sub-graph diverges from the main graph in exactly the handling the assert guards - rather than a crash any stock user is living with today, and “apply my SWA patch first” is a weak reproduction story for a maintainer to review against.
If you’re stacking SWA on this family some other way and hit the assert, the patch is right there. And if anyone feels like carrying it through the upstream contribution process, PLEASE be my guest - the diff is small, the reasoning is in this post, and I’d be happy to see it land. (Related upstream: #23322 is the same feature combination but a different problem - a runtime acceptance issue, not this build crash.)
Caveats
- One model, one architecture. Everything here is Qwen3.6-35B-A3B, the
qwen35moehybrid (10 full-attention GQA + 30 Gated-DeltaNet layers), UD-Q4_K_M with the bundled MTP head. The crash and fix are specific to that arch’s MTP graph. The acceptance-lift mechanism (a bounded context lowers the target’s entropy, which makes drafting easier) has no model-specific step in the argument, but I’ve measured it exactly once, on one family. Treat the direction as an argument and the magnitudes as this box, this model. - Config, for reproducibility: Vulkan/RADV,
--dynsparse-swa 16384 --dynsparse-swa-full 27,31,35,39, q4_1 KV,-ub 512, greedy decode,--spec-type draft-mtp. The SWA flags are from my fork (patches shipped with the 256K post); everything else is stock. - The worst-case comparison is the diversity-matched one. I actually ran two hard tasks at 79K, and in the other one the two configs went different directions off the same prompt: the SWA run kept continuing the document it was given (word-diversity 0.29, i.e. it settled into something predictable) while the dense run veered off into meta-reasoning about the task (0.68). Different generations means their acceptance numbers aren’t comparable, so that run doesn’t get to be the headline - though for what it’s worth, it showed the same shape: the moment the generation turned predictable, MTP+SWA won big (+32%). The -2.6% comes from the task where both configs measured equally diverse (~0.72). Raw numbers for everything in this post: the result JSON.
- Best case and worst case are different context depths (123K vs 79K), because they’re built from different source text. The dense baselines bracket them consistently, but don’t read the +46%-vs-(-2.6%) pair as a single controlled variable flip; it’s two points on the same curve.
- Single-stream, batch-1, as always in this series. Greedy decode, so MTP output is byte-identical to MTP-off and the comparison is a pure speed test.