Damen Knight

MTP Speculative Decoding on Strix Halo: How I Made It 3x Slower Before I Made It Faster

Sat, 30 May 2026 12:00:00 GMT

In the last post I landed on qwen3.6 as the most usable coding model I could actually run on the miniPC (a Strix Halo box: AMD Ryzen AI MAX+ 395, 128 GiB of unified memory, Vulkan). This is a MUCH shorter follow-up about squeezing more tokens/sec out of it with MTP, and about the ditch I drove into on the way.

The three-sentence version: turned on with its default settings, MTP made generation 3x slower! Tuned for this model, it’s about 18–26% faster. The difference between the two is a single number.

Contents

What the heck is MTP??
The wrong path
The knob that matters: --spec-draft-n-max
The config I'd actually use
What about the faster IQ4_XS build?
This is a Strix Halo / Vulkan result
The takeaway

What the heck is MTP??

Normally a model generates one token per forward pass: run the whole network, get one token, repeat. That’s slow. Speculative decoding speeds it up by first having a small, fast predictor — something far cheaper to run than the full model — guess the next several tokens, and then letting the full model verify that whole batch of guesses in a single forward pass. The trick is that checking several tokens at once costs the big model about the same as generating one. Every guess it accepts is a token you got essentially for free. The catch, of course, is that every rejected guess is wasted compute, both the draft’s and the verify’s. The math only pays off if the guesses are accepted often.

MTP (Multi-Token Prediction) is the self-speculative version: instead of running a second small “draft” model alongside the big one, the model ships with an extra lightweight head trained to predict a few tokens ahead. The draft and the verify come from the same model. llama.cpp added support in PR #22673 (merged 2026-05-16, build b9180 or later), exposed as --spec-type draft-mtp.

Qwen3.6 here is a 35B-A3B Mixture-of-Experts model: 35B total parameters, but only ~3B are active per token. That “A3B” part turns out to matter a lot for whether MTP helps.

The wrong path

The first thing I did, which was the first thing and not the smart thing, was the naive run: flip MTP on, leave everything at defaults, and see what stock settings buy you. It’s what most people will reach for, and the articles all quote ~2x, so why not? Generation dropped to 18.6 tok/s, down from ~60 with MTP off. So, the opposite of 2x. Definitely didn’t seem right lol.

The culprit was --spec-draft-n-max, the number of tokens the head is allowed to draft ahead before the model checks its work. It defaults to 16. Here’s what it gets you at a range of values (tok/s, higher is better):

MTP config (qwen3.6-35B-A3B, Unsloth Q4_K_M — early run)	tok/s
off (baseline)	59.7
on, `--spec-draft-n-max 16` (the default)	18.6
on, `--spec-draft-n-max 8`	27.6
on, `--spec-draft-n-max 4`	66.9

A caveat on these numbers: they were quick and exploratory. I killed off any stray model servers but skipped the full services-down, fans-pinned protocol I used for the recommended config below, so read the absolute baseline loosely. (It also idles at ~60 here rather than the ~65 you’ll see later because this early run used a slightly slower quant upload. Why two “Q4_K_M” files run at different speeds is its own rabbit hole, maybe a future companion post.) A 3x regression dwarfs either effect, which is the whole point.

So there we have it: for me, the default was a roughly 3x regression, and the line between useless and useful is narrow. n=8 is still slower than no MTP at all, and only by n=4 does it pull ahead. I almost wrote the whole thing up as “MTP doesn’t work on Strix Halo MoE” right there. Then I did the thing I should’ve led with if I’d actually wanted good numbers: went and read how MTP works.

The knob that matters: --spec-draft-n-max

The default of 16 is calibrated for dense, instruction-tuned models, which accept long draft runs. An A3B MoE is the opposite: only ~3B parameters fire per token, the head’s predictions get rejected sooner, and every rejected draft past the acceptance point is pure waste. The community guidance for A3B-class MoEs converges on n=2 or 3. Even llama.cpp’s own MTP pull request reports its best results around 3 draft tokens at roughly 75% steady-state acceptance, nowhere near the default 16, and acceptance only falls faster as you push past that.

The other thing worth saying plainly: the reported 2x speedup you may have seen for “Qwen3.6 + MTP” comes from other setups (the PR author quotes >2x, testing on a different stack than a power-capped Strix Halo). On this box, for the 35B-A3B MoE on Vulkan, the realistic ceiling I measured is more like 1.2x. Lower acceptance means less free speed. That’s not a failure, it’s just the honest number for this class of model on this hardware, and it’s still worth having.

There’s also a second knob, --spec-draft-p-min (the minimum probability the head needs before a draft is even attempted, default 0.75). Some guides call it the most impactful parameter. On my hardware, sweeping it from 0.5 to 0.9 stayed within sampling noise, so I left it at the default. Your mileage may vary; it’s worth a quick sweep, but n_max is the one that actually moved my numbers.

The config I'd actually use

Two setups: pick by whether you care more about disk or about the last drop of quality. Both are gfx1151 on Vulkan (see the caveat at the end if you’re on something else), same clean-bench methodology as the last post.

The build. You need llama.cpp with PR #22673, so b9180 or later. Check with llama-server --help and look for draft-mtp in the --spec-type modes. Build it Vulkan-only: there’s a known foot-gun (issue #23199) where, in a dual Vulkan+ROCm build, the MTP tensors get placed on the ROCm device and MTP is silently disabled even when you asked for Vulkan. I didn’t run into this thankfully (my build is Vulkan-only), but watch out.

Q4_K_M (smaller, faster, what I’d default to). Two files from bartowski/Qwen_Qwen3.6-35B-A3B-GGUF: the model (Qwen_Qwen3.6-35B-A3B-Q4_K_M.gguf, ~20 GiB) and the MTP head as a separate file (mtp-Qwen_Qwen3.6-35B-A3B-Q4_0.gguf, ~1 GiB).

llama-server \
  --model Qwen_Qwen3.6-35B-A3B-Q4_K_M.gguf \
  --model-draft mtp-Qwen_Qwen3.6-35B-A3B-Q4_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ngl 999 -c 8192 -fa on --parallel 1 \
  -t 32 -tb 32 -ub 2048 \
  -ctk q8_0 -ctv q8_0 \
  --reasoning off \
  --host 0.0.0.0 --port 8089 --alias qwen3.6

That gets ~77 tok/s, versus ~65 with the --model-draft and --spec-type lines removed. About +18%.

Q8_0 (more disk, a hair more quality). Same flags, the Q8_0 trunk and head (Qwen_Qwen3.6-35B-A3B-Q8_0.gguf ~35 GiB, mtp-Qwen_Qwen3.6-35B-A3B-Q8_0.gguf ~2 GiB), and one important change: --spec-draft-n-max 2 instead of 3. Expected: ~63 tok/s, versus ~50 with MTP off. About +26%.

The sweet spot drops to 2 at Q8 because the trunk’s own predictions are more confident, so it accepts fewer drafts from the head, and drafting a third token just wastes work. Sweep n_max (1, 2, 3, 4) on whatever model and quant combo you land on and use whichever wins.

The head quant matters less than you’d think, and lighter is better. A Q4_0 head on the Q4_K_M trunk hit 77 tok/s; a Q8_0 head on the same trunk did 74. The bigger head’s better predictions don’t pay for the extra bandwidth they cost.

Configs that lose, for completeness: the default --spec-draft-n-max 16 (~19 tok/s, a 3x regression) and --spec-draft-n-max 8 (~28 tok/s, still slower than no MTP). The default is the trap.

If you’d rather not juggle two files, unsloth/Qwen3.6-35B-A3B-MTP-GGUF bundles the trunk and head into one (drop --model-draft, keep --spec-type draft-mtp). I measured ~75 tok/s at n=3, about 2 tok/s behind the bartowski split, with identical quality on Polyglot, gsm8k, and ifeval. Use whichever workflow you prefer.

What about the faster IQ4_XS build?

There’s an IQ4_XS build with the MTP head baked in, and a pretty wild-sounding claim for IQ4_XS+MTP on Strix Halo was making the rounds: 90.8 tok/s average, 110.6 peak. IQ4_XS is a smaller quant than Q4_K_M (about 4.25 bits per weight versus ~4.8), and since generation is bandwidth-bound, smaller CAN mean faster, so it’s a plausible claim. I downloaded it and benched it the same way as everything above, but I was not able to reproduce it.

IQ4_XS + MTP (n=2), my box	tok/s
q8_0 KV cache	~78
f16 KV cache	~81
the number I was chasing	90.8 avg / 110.6 peak

Apples-to-apples, at the q8_0 KV cache my recommended config uses, IQ4_XS lands around 78 tok/s, a hair over the Q4_K_M + MTP setup above (~77) but inside the noise. Switching IQ4_XS to an f16 KV cache pushes it to ~81. That is a real few percent, but it is the cache talking, not the quant: f16 would lift the Q4_K_M numbers the same way. So IQ4_XS earns you a little (a smaller file), and a little more if you spend the extra memory on an f16 cache, but it is not the different league the 90+ figure implies. A raw llama-bench pass on the file came in at 73, so that is not where the headline comes from either.

My best guess for the gap to 90.8: I cap this box at 100W with ryzenadj for round-the-clock thermal stability (the “I melted it twice” saga from the last post). Run the chip hotter, or on a newer build, and you would probably claw some of it back. Worth a shot if you have the thermal headroom. But on a power-limited Strix Halo it is a modest step over the Q4_K_M + MTP config, not a leap, and nothing I measured got close to 90. There are some thermal upgrades I may attempt to make, and maybe I’ll revisit this with a higher cap if I do.

This is a Strix Halo / Vulkan result

The numbers above are gfx1151 on Vulkan. On CUDA, an early community writeup for this same model found no net speedup from llama.cpp’s speculative-decoding paths: it tested 19 configurations on an RTX 3090 and found none faster than baseline (the same author’s HackMD notes lay out the detail). Note that’s a llama.cpp-specific result, the same author found vLLM’s MTP faster on the same card. So the speedup here may be specific to the Vulkan path or to the unified-memory architecture. If you’re on a 3090 or an A100, don’t expect these numbers, and if you measure your own, publish them: there’s a bit of public A3B-MoE spec-decode data now (mostly CUDA, plus a Strix Halo ROCm run or two), but I couldn’t find a single A3B-on-gfx1151-via-Vulkan benchmark out there, so that corner is wide open.

The takeaway

MTP on this hardware is a real but modest gain: ~18–26%, not the 2x you’ll see quoted for the dense model. And the default config is actively harmful on an A3B MoE. If you take one thing from this: turn --spec-draft-n-max down to 2 or 3 before you decide whether MTP works for you.

Strix Halo vs an A30 vs the Frontier: What the miniPC Can (and Can't) Actually Do

Sat, 23 May 2026 12:00:00 GMT

A few months ago I wrote about getting a Strix Halo miniPC (~$2.5K all-in) to run a frontier coding model. That led to people asking me how I was measuring/testing models (I really wasn't, beyond what my anecdotal experience was). That led to me starting to actually benchmark stuff - but then I thought... why not throw my A30 GPU in the mix while I'm at it, and really see what's what? They even cost me (roughly) the same.

The short version: I didn't lie! It is roughly like Sonnet when it's a bit slow. BUT, things have changed and now you should use qwen3.6. Also there are some other neat findings ahead ;)

The longer version: quality is hardware-agnostic, which is exactly what you'd expect (run the same model on a Strix Halo APU or a datacenter A30 and you get the same benchmark score within noise, because the hardware should decide how fast, not how smart). The interesting questions were always about speed and fit, and that is where the miniPC shines!

Locally-hosted qwen3.6-thinking (a 35B-A3B Mixture-of-Experts (MoE) model, that is, 35B total parameters with only ~3B activated per token) scores 62.2% on the Aider Polyglot benchmark, sitting right between Claude Sonnet 4 thinking (61.3%) and Claude 3.7 Sonnet thinking (64.9%), and it does that at 45 tok/s with a 65K-token context. That is Sonnet-class capability, at usable speeds, on a ~$2.5K box. It is not Opus-class (72.0%) or GPT-5-class (88.0%), but Sonnet-class coding on hardware I own, at no per-token cost and with nothing leaving my network.

The surprises were both about memory, not compute. First: even when a model fits comfortably in the A30's VRAM, the A30 can still lose at long context. The A30 wins at short context, but on qwen3-30b-a3b at 65K depth the miniPC outruns it by 36% (28.1 vs 20.6 tok/s), because the A30's 24 GiB runs out of room for the KV cache (the attention key/value store that grows with every token of context) while the miniPC's 128 GiB of unified memory shrugs it off (the crossover lands somewhere between 8K and 32K). qwen3-30b-a3b is only 17 GiB, so this isn't even an offload problem, it's all running on the GPU, the A30 just can't hold a big model and a big KV cache at once.

The bigger surprise was the hybrid cliff. For any model that doesn't fit in 24 GiB at all, the miniPC wins outright. qwen3-coder-next (80B-A3B) has to be split across GPU and CPU on the A30, and once half its layers (plus their KV) live in slow DDR4 system memory and get streamed over PCIe every token, it crawls: 3.9 tok/s at 65K against Strix Halo's 38, almost 10x. That gap, a unified-memory box versus a GPU forced into hybrid offload, is the whole reason I bought the thing.

The recommendation has also moved since the last post. Back then, qwen3-coder-next was the best local-fitting coding model I had, and I called its quality "something like Sonnet 4.5 on a slow day." That was ... close enough that I won't say I was wrong, hah (it lands about 9pp below baseline Sonnet 4), but qwen3.6 has since released, and on the same Strix Halo hardware it is straightforwardly better: higher quality, faster, and a smaller VRAM footprint. If you have been running qwen3-coder-next since my last post, switch to qwen3.6.

Contents

The contestants
The benchmarks
Quality: Strix Halo vs A30
Quality: Local vs the Frontier
Coding-specific: Aider Polyglot
Saturated benchmarks: HumanEval+ and lm_eval
Quality at depth: NIAH
Speed: default throughput
Speed at long context
gpt-oss-20b (fits both)
qwen3-30b-a3b-2507 (fits both)
qwen3.6-thinking (Strix Halo only, Sonnet-tier model)
qwen3-coder-next 80B-A3B (Strix Halo UMA vs A30 hybrid offload)
DeepSeek-Coder-V2-Lite, the bonus weird result
Bonus models (Strix Halo coverage only)
What each box is actually best for
What I'd do differently
The end result

The contestants

Spec	Strix Halo (the miniPC)	A30
GPU/APU	AMD Ryzen AI MAX+ 395 (gfx1151)	NVIDIA A30 PCIe
Architecture	RDNA3.5-class iGPU, unified memory (UMA)	Ampere GA100, dedicated PCIe
Memory	128 GiB LPDDR5 (shared CPU+GPU)	24 GiB HBM2
Backend	llama.cpp Vulkan (RADV)	llama.cpp CUDA
TDP (effective)	100W (after `ryzenadj`)	165W
Host system	GMKtec NucBox EVO-X2	R740 / Proxmox VM (16c/64G)
Approx all-in cost	~$2,500	~$2-3K used (card only; bring your own server)
Largest model that fits	gpt-oss-120B (~64 GiB), qwen3-coder-next 80B-A3B (~45 GiB)	qwen2.5-coder-32B (Q4_K_M) at most

The "what fits" row is the most important and most under-discussed difference, with one clarification: it means fits entirely in VRAM, at full speed. You can push a bigger model onto the A30 with hybrid GPU/CPU offload (I do exactly that later in the post), it just runs much slower once part of it spills out of the 24 GiB. The miniPC's unified memory has no such cliff: anything up to 128 GiB loads and runs on the iGPU at full speed, including a 120B model the A30 can't hold in VRAM at all.

It's worth noting what's doing the spilling on the A30 side: its host is an older Dell R740 on DDR4, so when a model overflows into system RAM, DDR4 bandwidth is the bottleneck. A newer DDR5 host with more memory channels would lift the hybrid numbers, but it would also cost more, which is sort of the point: the unified-memory miniPC sidesteps the spill entirely, for the price of a single mid-range box.

The benchmarks

Four things, each measuring something different:

Aider Polyglot, 225 multi-language Exercism coding exercises, the model is asked to edit existing files to make tests pass. This is the only benchmark on the list that resembles real-world agentic coding work, and it's the one frontier models actually struggle with. Not saturated.
HumanEval+, function-level code generation, 164 problems. Top models all score 90%+. Saturated.
lm_eval (gsm8k, ifeval, mmlu_pro), knowledge and instruction-following at single-prompt level. Frontier models saturate this too.
llama-bench, pure throughput, no quality signal. Two numbers matter, both in tokens/sec: pp (prompt processing, how fast the model ingests your prompt) and tg (token generation, how fast it writes the reply). I report the defaults pp512 / tg128 (a 512-token prompt, 128 generated tokens) plus depth tests for long-context behavior.

I treat Polyglot as the load-bearing quality metric because (a) it actually discriminates, and (b) it's what I care about, agentic coding is what these boxes get used for in practice. If there's a benchmark I didn't run, it is because I don't know about it or didn't think of it.

One reading convention for every table below: higher is better, unless I explicitly say otherwise.

Quality: Strix Halo vs A30

Nobody expects a model to get smarter or dumber depending on whether it runs on AMD or NVIDIA silicon, and it doesn't. This was never really an open question. But I had both boxes and was running the benchmarks anyway, so I figured I'd confirm it and see where any real differences showed up. The answer: quality is the same within noise, with one model-specific surprise. Boring but necessary setup for everything that follows.

Model	Strix Halo (Vulkan)	A30 (CUDA)
gemma-3-27b-it (HumanEval+ p@1+)	78.7%	77.4%
qwen2.5-coder-32b (HumanEval+ p@1+)	85.4%	86.6%
qwen3-30b-a3b-2507 (HumanEval+ p@1+)	89.0%	89.6%
qwen2.5-coder-32b (Polyglot)	25/225 (11.1%)	25/225 (11.1%)
qwen3.6 (Polyglot, no think)	121/225 (53.8%)	106/225 (47.1%)

("p@1+" is pass@1 on EvalPlus's extended test set, meaning the model's first answer has to pass every test.)

Within noise on HumanEval+ across the board, and on the qwen2.5-coder Polyglot row. The qwen3.6 Polyglot row shows a 6.7pp cross-host gap (53.8% Strix Halo vs 47.1% A30), which is larger than I'd expect from pure sampling noise; possibly a real CUDA-vs-Vulkan difference for that specific model and harness, or a build-version skew between the two boxes. The HumanEval+ gemma/qwen2.5/qwen3-30b rows on the same model files agree exactly cross-host, so it isn't a general "the A30 produces worse logits" pattern; it's a qwen3.6-Polyglot-specific finding I'd want to dig into in a future bench.

So the model is mostly the model, and hardware doesn't make it dumber in any general sense. There can be model-specific cross-host quirks worth checking (this one came as a surprise to me), but for the typical case, once you've picked a model that fits, the hardware question reduces to how fast and can it even fit.

Quality: Local vs the Frontier

Here's where it gets fun. I'm going to split this into coding and non-coding because they behave very differently.

Coding-specific: Aider Polyglot

Polyglot is the benchmark where frontier models still have headroom, and the one that tracks "how good is this thing as a coding agent." Here's the comparison (Aider leaderboard scores for the API models, my results for local):

Model	Polyglot pass rate	Notes
GPT-5 (high)	88.0%	API
Gemini-2.5-Pro (32k think)	83.1%	API
DeepSeek-V3.2 Reasoner	74.2%	API (open weight ~700B, won't fit my hardware)
Claude Opus 4 (32k think)	72.0%	API
Claude Opus 4 (no think)	70.7%	API
Claude 3.7 Sonnet (32k think)	64.9%	API
qwen3.6-thinking (Strix Halo)	62.2%	local, 35B-A3B MoE
Claude Sonnet 4 (32k think)	61.3%	API
Claude 3.7 Sonnet (no think)	60.4%	API
Claude Sonnet 4 (no think)	56.4%	API
qwen3.6 (Strix Halo, no think)	53.8%	local
qwen3-coder-next (Strix Halo)	47.6%	local, 80B-A3B MoE; doesn't fit on A30, see the Speed sections
qwen3.6 (A30, no think)	47.1%	local
GPT-OSS-120B (high)	41.8%	leaderboard score, API
qwen3-30b-a3b-2507 (Strix Halo)	30.2%	local
qwen3-30b-a3b-2507 (A30)	28.9%	local
GPT-OSS-20B-thinking (A30)	16.9%	local
GPT-OSS-120B (Strix Halo, Q4_K_M)	1.8%*	local, almost certainly broken locally, see footnote

* The 23× gap between local gpt-oss-120B (1.8%) and the same model's API leaderboard score (41.8%) is almost certainly the reasoning_effort parameter not wiring through to llama.cpp's gpt-oss path: low/medium/high produce near-identical outputs within sampling noise. For a model whose top-line capability is its reasoning depth, a broken reasoning knob is a broken model. Full discussion in item 3 below.

(All API model scores in this table come from the Aider Polyglot leaderboard, last updated 2025-11-20. A few newer frontier releases (Google's Gemini 3 and Anthropic's Claude Opus 4.5 / 4.7) exist but haven't been scored by the Aider team yet, so they aren't represented above. The most recent Gemini and Opus variants the leaderboard does have are Gemini 2.5 Pro 32k-think at 83.1% and Claude Opus 4 32k-think at 72.0%.)

What this shows:

Sonnet-class is achievable locally, in both thinking and no-think modes. My best local model (qwen3.6-thinking, a 35B-A3B MoE) sits right in the Claude Sonnet thinking band (62.2% vs Sonnet 4 thinking 61.3%). And on the apples-to-apples no-think comparison, qwen3.6 with thinking off (53.8%) is just 2.6pp under Claude Sonnet 4 no-think (56.4%); effectively tied within Polyglot's noise floor. So it's not just "Sonnet-class when allowed to think"; it's "Sonnet-class without needing to think." That second result was the bigger surprise.
The recommendation has moved since my last post. Back then, qwen3-coder-next (80B-A3B) was the best local-fitting coding model I had, and the explicit subject of the previous post. qwen3.6 didn't exist yet. Now it does, and it's straightforwardly better: 53.8% Polyglot at thinking-off (vs qwen3-coder-next's 47.6%), 62.2% at thinking-on, smaller VRAM footprint, faster throughput. If you've been running qwen3-coder-next on Strix Halo since my last post: try qwen3.6.
The real gap is to GPT-5, Gemini 2.5 Pro, and Claude Opus. Those three are ~10-26pp ahead of my best local model. The Anthropic ladder is worth calling out specifically: qwen3.6-thinking (62.2%) is essentially tied with Sonnet 4 thinking (61.3%), but Anthropic's actual flagship is Opus, which scores 72.0%, about 10pp ahead of local. Then GPT-5 (88.0%) and Gemini 2.5 Pro thinking (83.1%) are the real top of the leaderboard. DeepSeek V3.2 Reasoner (74.2%) is the closest open-weight to that band, but at ~700B parameters it won't fit on either of my boxes.
Local quants underperform their API counterparts catastrophically on some models. My local gpt-oss-120B Q4_K_M scored 1.8%; the leaderboard's gpt-oss-120b (high) scored 41.8%. That's a 23x gap, not a small one. Three things contribute: quantization, the reasoning_effort parameter doesn't actually wire through to the model on llama.cpp (I verified this; low/medium/high produce near-identical outputs within sampling noise), and I used Aider's whole edit format vs the leaderboard's diff. The reasoning-effort issue is probably the biggest factor; gpt-oss is essentially a reasoning model, and if the reasoning depth knob is broken, the model is operating in something close to a "low effort" mode regardless of what you pass in.
Thinking mode is meaningful when measured correctly. qwen3.6 without thinking: 53.8%. With thinking: 62.2%. That's 8.4 pp of capability sitting behind a flag.

A caveat about polyglot versioning

My test harness ran 225 exercises on most models, the same set as Aider's leaderboard. A few runs got 289 or 450 (multiple attempts per exercise from a config tweak); rates are still computed as passed/total. Edit-format matters too: I used whole because it's more robust to weaker models, while Aider's leaderboard uses diff because it gets better scores from the top models. whole is generally a slight handicap. Treat the comparisons as directional, not exact.

Methodology note: all HumanEval+ numbers in this post come from evalplus.codegen, the canonical scorer behind EvalPlus's published leaderboard.

Saturated benchmarks: HumanEval+ and lm_eval

These are the benches where frontier and local models all score in the same 85-95% range, they don't discriminate well anymore. Quick look:

Cross-host gsm8k + ifeval, identical Q4_K_M quantization, identical chat-completions API, 200 items each:

Model	Strix Halo gsm8k	A30 gsm8k	Strix Halo ifeval	A30 ifeval
gemma-3-4b-it	78.5%	82.5%	69.5%	68.0%
qwen3-4b-2507	90.0%	90.0%	79.5%	80.5%
gemma-3-12b-it	92.5%	91.0%	72.0%	72.5%
gemma-3-27b-it	93.5%	93.5%	77.5%	76.0%
qwen3-30b-a3b-2507	95.5%	93.5%	79.5%	80.5%
qwen2.5-coder-32b	95.0%	95.0%	75.0%	75.0%
phi-4 (14B)	90.5%	91.0%	56.5%	57.0%
mistral-small-3.2-24b	95.0%	94.5%	76.5%	72.0%
qwen3.6-thinking	96.5%	n/a (model not on A30)	79.0%	n/a
gpt-oss-20b (reasoning off)	87.5%	n/a (different run config)	25.5%	n/a

The cross-host rows agree within 1–2 percentage points across the board. Same model, same quantization, same prompt, same score within noise.

Notable observations:

qwen3.6-thinking tops gsm8k at 96.5%, better than any A30 result in this set, and in the same ~95–97% band frontier models hit on saturated math benches before the frontier moved on to AIME / FrontierMath. On a 35B-A3B MoE running on a miniPC.
gpt-oss-20b ifeval at 25.5% is shockingly low for a model that hits 87.5% on gsm8k in the same run. This is the --reasoning off configuration. The other gpt-oss-20b runs in my data, reasoning on variants, also fall in the 25–36% ifeval band, so this isn't a reasoning-flag artifact; gpt-oss-20b just struggles with strict prompt-following regardless. Worth knowing if you were planning to deploy it for instruction-bound tasks.
phi-4 inverted profile: 90.5% gsm8k but only 56.5% ifeval. It's a math-strong, instruction-weaker model. Useful data point for choosing models by use case.

Same story as the coding benches: local models are basically tied with frontier on saturated benchmarks. A Qwen3-30B-A3B on a miniPC scores 95.5% on gsm8k, comfortably in the same band as any frontier model that's still being measured against gsm8k. The frontier moat still exists, but it's on real-world agentic coding (Polyglot).

Quality at depth: NIAH

Throughput at 65K context is meaningless if the model can't actually find anything at 65K. I tested needle-in-haystack retrieval (single-needle: ask "what's the best thing to do in San Francisco?" with a sandwich-and-Dolores-Park needle planted at 10%, 50%, or 90% depth in a haystack of Paul Graham essays):

Model	Host	Pass rate (4K/16K/32K/60K × 3 depths)
qwen3.6-thinking	Strix Halo	100% (12/12)
qwen3-coder-next	Strix Halo	100% (12/12)
qwen3-30b-a3b-2507	A30	100% (12/12)
qwen2.5-coder-32b	A30	100% (12/12)
gemma-3-27b-it	A30	100% (12/12)
phi-4	A30	100% (12/12)
mistral-small-3.2-24b	A30	100% (12/12)
llama-4-scout-17b-16e	A30	100% (12/12)
qwen3-4b-2507	A30	100% (12/12)
gpt-oss-20b	A30	91.7% (11/12)
granite-3.1-8b-instruct	A30	88.9% (failed at depth)
deepseek-coder-v2-lite	A30	33.3% (4/12, all 4K passes, 1 of 3 at 16K, every 32K and 60K cell timed out at 600s), same root cause as the llama-bench MLA cliff: CUDA-on-MLA is too slow at depth to finish a single query inside any reasonable budget

Top-row finding: Strix Halo's qwen3.6-thinking and qwen3-coder-next both score perfect retrieval at 60K context, with response times of 1-2 min per query. The model isn't just running with that context, it's actually using it. Combined with the throughput numbers, this is what makes the miniPC a real coding-agent target rather than a benchmark curiosity.

Speed: default throughput

Quality matters; speed matters more than people think. A 62% model running at 1 tok/s is unusable. A 50% model at 80 tok/s is a daily driver.

(Methodology note before the tables: every Strix Halo throughput number below was collected with no other model servers running, fans pinned to max, and free memory verified before each run. There's a bench wrapper now that refuses to start without those conditions met. I ended up writing it after melting the poor machine twice, details in What I'd do differently below.)

Default pp512 / tg128 numbers (Q4_K_M, -fa 1, Strix Halo on q8_0 KV / A30 on q4_0 KV, see longctx section for the protocol note). Throughput is in tokens/sec, so higher is better. The last column is the one place bigger isn't better: it's the A30/miniPC tg ratio, where above 1.0 means the A30 is faster and below 1.0 means the miniPC wins (I flag those rows inline).

Model	Strix Halo pp	A30 pp	Strix Halo tg	A30 tg	A30/Strix Halo tg ratio
qwen3-4b-2507	2048	3934	75.0	118.9	1.59x
gemma-3-4b-it	2257	4431	74.1	109.5	1.48x
gemma-3-12b-it	750	1590	26.9	52.4	1.95x
phi-4 (14B)	652	1452	24.1	53.2	2.21x
gpt-oss-20b	1287	2805	80.8	130.2	1.61x
mistral-small-3.2-24b	267	905	15.3	34.5	2.26x
gemma-3-27b-it	230	771	12.6	28.0	2.23x
qwen3-30b-a3b-2507 (MoE)	1167	2274	87.0	136.2	1.56x
qwen2.5-coder-32b	186	633	11.1	24.1	2.18x
qwen3-coder-next (80B-A3B)†	551	110	56.4	12.2	0.22x ← Strix Halo wins 4.6x
qwen3.6 (35B-A3B MoE)	944	1933	67.1	99.9	1.49x

† The A30 row for qwen3-coder-next is hybrid GPU/CPU offload (22 of 49 layers on GPU, the rest on CPU/RAM). The 45 GiB Q4_K_M model can't fit fully in 24 GiB VRAM, so this is what you get if you force it onto the A30 anyway, the apples-to-apples speed cost of exceeding the VRAM ceiling on a dedicated GPU.

Two stories here:

1. A30 wins at default by 2-3x. Expected, a dedicated GPU with proper VRAM and CUDA kernels should beat an APU running Vulkan. The factor is consistent across dense models in the 2.2-2.8x range.

2. MoE narrows the gap and makes the miniPC viable. Look at qwen3-30b-a3b-2507: A30/Strix Halo ratio is just 1.56x for tg, the smallest gap in the table among the bigger models. That's because the model only activates ~3B params per token. Memory bandwidth matters more than raw compute for tg, and Strix Halo's UMA gives it surprisingly good bandwidth for active-parameter-light workloads. (The 4B models also show ratios below 2x, small models stop benefiting from the A30's compute headroom because they're already bandwidth-bound on both boxes.)

Compare that to the dense qwen2.5-coder-32b: 11.1 tok/s on Strix Halo vs 24.1 on A30, still a 2.18x gap but the absolute number is terrible on Strix Halo. I don't know about the rest of you, but 11 tok/s on a 32B dense model is not exactly what I'd call "usable". I'd never reach for the dense coder if a comparable-quality MoE exists.

Speed at long context

Now the fun part. Wait, I already said that. Another fun part! Coding agents send long context (the codebase, the test results, previous turns), so what happens when you push the depth?

I ran the same pp512 / tg128 test at depths 0 / 8K / 32K / 65K. Strix Halo is benched with q8_0 KV cache (matches how the production llama-servers are deployed). A30's previously-collected longctx sweep was at q4_0 KV; the small protocol asymmetry is mildly conservative for Strix Halo at depth (q4_0 saves a bit of KV bandwidth at the cost of dequant overhead, within MC noise on this hardware, but if anything it shaves a few percent off the Strix Halo side at deep contexts).

gpt-oss-20b (fits both)

Depth	Strix Halo pp	A30 pp	Strix Halo tg	A30 tg	A30/Strix Halo tg
0 (default)	1287	2805	80.8	130.2	1.61x
8K	958	2522	66.6	109.5	1.64x
32K	547	1933	56.9	77.2	1.36x
65K	338	1452	45.6	54.5	1.20x

A30 tg dropped 58% from default to 65K depth (130 to 55 tok/s). Strix Halo tg dropped 44% over the same range (81 to 46 tok/s). A30 still wins on this model at every depth, but the lead shrinks dramatically as context grows, the A30/Strix Halo ratio compresses from 1.61x at default to 1.20x at 65K.

qwen3-30b-a3b-2507 (fits both)

Depth	Strix Halo pp	A30 pp	Strix Halo tg	A30 tg	A30/Strix Halo tg
0 (default)	1167	2274	87.0	136.2	1.56x
8K	533	1746	62.1	72.5	1.17x
32K	205	1012	40.2	35.1	0.87x ← Strix Halo wins
65K	110	631	28.1	20.6	0.73x ← Strix Halo wins by 36%

This is where it gets spicy. A30 tg dropped 85% from default to 65K (136 to 21 tok/s), the 24 GiB VRAM ran out of room for a meaningful KV cache at depth. Strix Halo tg dropped 68% over the same range (87 to 28 tok/s), painful but consistent. Crossover happens between 8K and 32K context. At 32K the miniPC is already faster; at 65K it's 36% faster than the dedicated GPU.

The model itself is 17 GiB Q4_K_M. The A30 has 24 GiB of VRAM. At 65K context the KV cache plus activations plus the model are competing for that 7 GiB headroom, and CUDA's memory management gets bottlenecked. Strix Halo's 128 GiB UMA doesn't care, there's so much memory headroom that the only constraint is compute and bandwidth, both of which degrade gracefully.

qwen3.6-thinking (Strix Halo only, Sonnet-tier model)

This is the model I'd actually use for coding. The numbers are remarkable:

Depth	Strix Halo pp	Strix Halo tg
0 (default)	944	67.1
8K	790	61.8
32K	517	55.6
65K	349	45.5

tg drops 32% from default to 65K depth (67.1 to 45.5 tok/s). A Sonnet-class model running locally at 45 tok/s with a 65K-token context window. That's actually usable for serious agentic coding, you can pack a meaningful chunk of a codebase into the context and not pay a brutal speed tax for it.

A note on Q8_0: I also ran the no-think qwen3.6 at Q8_0 (38 GiB on disk vs Q4_K_M's 20 GiB). Polyglot moved from 53.8% to 56.9%, a ~3 pp gain. Throughput dropped from 65 tok/s to 50 tok/s at default and is similarly proportional at depth. So if you have the disk and want every last point of Polyglot, Q8_0 is a real upgrade. If you'd rather have the speed, Q4_K_M is the right call, the quality gap is small relative to the speed cost.

qwen3-coder-next 80B-A3B (Strix Halo UMA vs A30 hybrid offload)

The 80B-A3B that motivated the last post. At 45 GiB Q4_K_M it doesn't fit in 24 GiB VRAM, so the A30 column here is hybrid GPU/CPU offload (-ngl 22, 22 of 49 layers on GPU, the rest streamed from system RAM). Strix Halo's 128 GiB UMA swallows the full model and runs entirely on the iGPU:

Depth	Strix Halo pp	A30 hybrid pp	Strix Halo tg	A30 hybrid tg	A30/Strix Halo tg
0 (default)	551	110	56.4	12.2	0.22x ← Strix Halo wins 4.6x
8K	500	109	52.5	9.3	0.18x ← Strix Halo wins 5.6x
32K	372	109	46.9	5.4	0.12x ← Strix Halo wins 8.7x
65K	256	106	38.1	3.9	0.10x ← Strix Halo wins 9.8x

This is the clearest "wrong tool for the job" result I had. The A30 is a good card, it just doesn't have enough VRAM to hold the model, and PCIe bandwidth between GPU and host RAM is roughly 30x slower than the A30's own HBM2. So every token has to drag activations across that bottleneck.

The math: A30 hybrid tg falls from 12.2 to 3.9 tok/s (a 68% drop) over the depth sweep, while Strix Halo's UMA tg falls from 56.4 to 38.1 (only 32%). The A30 falls off twice as steeply because attention has to read the full KV cache to produce each new token, and on hybrid mode roughly half the model's layers, plus their slice of the KV cache, live in CPU RAM (DDR4, on this server). Each token's attention op pays PCIe-bandwidth overhead, and that overhead scales with context length. So 4.6× at default and 9.8× at 65K.

On the Strix system the story is the other way around: the iGPU has the same bandwidth to all 128 GiB as it does to the first 16 GiB. There's no VRAM cliff to fall off because there's no VRAM/RAM distinction at all. tg drops 32% from default to 65K (56.4 to 38.1 tok/s), painful but consistent, and at 38 tok/s with 65K of context loaded it's still... not fast, but usable.

(I also tried to run Aider Polyglot on A30 hybrid for a quality cross-check; the harness's per-call timeout repeatedly fired against the 3.9–9.3 tok/s hybrid response rate, and I abandoned the run after 9 of 225 exercises in ~5 hours. Throughput data above is from llama-bench directly, which doesn't have that problem.)

DeepSeek-Coder-V2-Lite, the bonus weird result

I benchmarked this one for completeness, expecting nothing exciting. Instead I found one of the clearest "the dedicated GPU is broken here" results in the whole sweep. DeepSeek-V2's Multi-head Latent Attention (MLA) uses a low-rank-projected KV cache that's smaller than standard MHA but requires a different attention kernel. The CUDA implementation in llama.cpp build 9064 falls off a cliff once any KV is present:

Depth	Strix Halo pp (Vulkan)	A30 pp (CUDA)	Strix Halo tg	A30 tg	A30/Strix Halo tg
0 (default)	1641	408	106.0	88.5	0.83x ← Strix Halo wins
8K	1032	17	64.4	4.9	0.08x ← Strix Halo wins 13x
32K	484	wedged	30.8	wedged	n/a
65K	250	wedged	17.4	wedged	n/a

The A30 bench actually wedged my harness, at d=32K, the CUDA kernel grinds at ~3-5 tok/s prefill, which means a single measurement of the 32K-token prefill would take 100+ minutes. I killed it after 17 minutes of no progress.

Strix Halo's Vulkan path handles MLA at depth normally, degrading from 106 to 17 tok/s tg is a real cliff, but it's a finite one and the bench actually finishes. Even at d=0 Strix Halo is 4× faster on pp512 (1641 vs 408), and that's before any KV is in play. The CUDA backend isn't just slow at depth on this architecture, it's just slow on this architecture.

This isn't a hardware issue, I don't think, it's a software bug. Presumably to be fixed in some future llama.cpp release lol. But for anyone considering DeepSeek-V2-family models for coding right now the miniPC is the only sensible target. A 24 GiB A30 will load the model just fine and then be fairly unusable.

Bonus models (Strix Halo coverage only)

For completeness, three more models I benchmarked on Strix Halo to fill out the table:

Model	Default pp	Default tg	65K pp	65K tg	Note
granite-3.1-8b-instruct	996	39.5	(crashed)	(crashed)	Vulkan device-lost at d=65K, got d=0/8K/32K only
llama-4-scout-17b-16e	159	20.1	105	13.9	17B-active, 109B-total, slowest in the post but flattest depth scaling (only 31% tg drop)

What each box is actually best for

Strix Halo as a coding agent: qwen3.6 with thinking on when I want quality, qwen3.6 with thinking off when I want speed/quality balance. Same model file, same throughput, just flip the --reasoning flag.
A30 for serving small concurrent requests: gpt-oss-20b at 130 tok/s or qwen3-30b-a3b at 136 tok/s is great for embeddings, rerank, and utility models in a stack.

These are different jobs. The boxes aren't substitutes; they're complements.

What I'd do differently

Update everything to the latest first. I spent a week chasing scores that looked too low only to realize my llama.cpp was 700 commits behind on reasoning-channel handling. Thinking models scored 0% on lm_eval because the reasoning content was consuming the entire context budget. A rebuild fixed it. This stuff moves fast, llama.cpp lands fixes weekly, so pull and rebuild to the latest before you trust a single number.
Bench with -d from the start, not -c. The -c arg got removed from llama-bench in recent builds; the replacement is -d for testing tg at a given KV depth. My first A30 long-context sweep died at parse time. Trivial fix in retrospect, but it cost me half a day.
Don't trust HumanEval+ as a discriminator. Everything competent scores 85%+. The bench doesn't separate "okay" from "great." Polyglot is what actually matters; I should have run it first.
Run whole and diff edit formats both. I ran everything in whole because it's robust for weak models. That makes the strong-model comparisons against Aider's leaderboard (which uses diff) slightly unfair to the local models. Doing both would have given a cleaner local-vs-API comparison.
Treat thermals and bench cleanliness as first-class concerns. Two specific traps cost me roughly a week of redo work:
- Don't re-make the same thermal mistakes as last time. I already worked this box's thermals out in the last post: sustained GPU load trips it unless you cap power with ryzenadj and pin the fans manually, because the stock fan curve is tuned for desktop bursts, not back-to-back benchmarks holding the GPU near 100% for minutes at a time. Then I forgot to actually turn any of that on before kicking off a multi-hour sweep, and crashed the box twice (no kernel log, just unreachable until a power-cycle) rediscovering a lesson I'd already written down. The fix was the one I already had on the shelf: mode=fixed level=5 on all three fans (under /sys/class/ec_su_axb35/fan*/) before any sustained workload. The wrapper now refuses to start a bench unless the fans are confirmed above 3500 RPM.
- Keep other model servers cleared out the whole time, not just at the start. Any concurrent llama-server process --mlock's its model into RAM and steals memory bandwidth from the bench. I caught this when a spot-check tg128 re-run came in 5% higher than the recorded number with everything else stopped. Five percent is small enough to miss in a single run and big enough to materially change rankings across models. The real trap is that it's easy to start clean and then let stray servers creep back in over a long session, so the fix isn't a one-time cleanup, it's re-verifying nothing else is loaded before every single run. Every Strix Halo throughput number in this post was collected that way, and the wrapper enforces it as a precondition.
The meta lesson: a bench harness that requires you to remember the discipline will eventually run dirty. Make the harness refuse to run unless the conditions are met.

The end result

Use the right tool for the job. Shocking, I know.

The miniPC can be a Sonnet-tier coding agent (when running the right model) that costs about $2,500 once and never sends my code anywhere. The A30 box is for smaller task-specific models that need high throughput.

The local-vs-frontier gap is still real on the hardest problems and on real agentic Polyglot work, but it's roughly Sonnet-class for daily-driver coding tasks, and the gap is closing. The next time someone benchmarks this, I expect the frontier-API moat to be at least a little bit smaller.

*Footnote: the reasoning_effort parameter not wiring through to llama.cpp's gpt-oss path is documented elsewhere; I verified by running effort=low/medium/high through lm_eval gsm8k and getting near-identical scores (90% / 86% / 86%), within the sampling noise band for a 200-item subset. If the flag were actually doing anything, I'd expect monotonic improvement from low to high; instead "high" is the same as "medium" and "low" comes out higher than both, which only makes sense if all three are effectively the same configuration plus sampling noise. A separate post about this might be coming.*

Running a Frontier Coding Model on an Under-$3K Mini PC

Thu, 12 Mar 2026 12:00:00 GMT

TL;DR: I got Qwen3-Coder-Next (80B MoE) running at 46 tok/s on an under-$3K mini PC. It took a full OS reinstall, a firmware downgrade, kernel parameter archaeology, a thermal crisis, and throwing out about half the tuning advice I found online. Here's everything I learned the hard way.

Why This Hardware

My existing GPU setups didn't have enough VRAM to run some of the larger models I was interested in testing. Discrete GPUs with 48+ GB of VRAM are absurdly expensive, and splitting a model across multiple consumer cards comes with its own headaches and PCIe bottleneck tax. So I started looking into UMA (Unified Memory Architecture) systems — where the CPU and GPU share the same memory pool — as a significantly more affordable way to get a ton of usable memory for inference.

That led me to the Ryzen AI MAX+ 395. It's a weird chip — a laptop/mini-PC APU with 32 Zen 5 cores, a 40-CU RDNA 3.5 iGPU, and support for up to 128 GB of LPDDR5 unified memory. Since the CPU and GPU share the same pool, the GPU can address all 128 GB without PCIe bottlenecks. For LLM inference, where model weights need to stream through the compute units every single token, that's a huge deal.

The theoretical memory bandwidth is 256 GB/s (LPDDR5X-8000 on a 256-bit bus). In practice I measured around 212-215 GB/s — about 82% efficiency. That's slower than an M4 Max (~546 GB/s) but faster than trying to cram a 70B model across two consumer GPUs and eating the PCIe tax.

The GMKtec NucBox EVO-X2 packages this chip into a mini PC chassis for under $3K with 128 GB RAM — though with the way LPDDR5 prices have been going lately, check current pricing before you get too excited. There are a few other options with this chip: Framework makes a Desktop, ASUS has the ROG Flow Z13 tablet, and Minisforum has the EliteMini AI Max. The GMKtec was the best price-to-performance option I found at the time, but it's worth shopping around.

The OS: Rocky Linux 9.7

I'm running Rocky Linux 9.7 — enterprise stability, good package ecosystem, SELinux actually works properly. Any RHEL 9 derivative should work similarly.

The Three Things That Must Be Right

After the base OS was clean, I hit a wall. A really frustrating wall. Getting this hardware working properly requires three specific things to be correct — the right kernel, the right firmware, and thermal power limits that won't let the system cook itself to death. I'm going to cover all three here because skipping any one of them will ruin your day.

1. Kernel 6.18.4 or newer

The KFD (Kernel Fusion Driver) in older kernels has a page table bug specific to gfx1151. Any GPU tensor allocation triggers "Memory access fault: Page not present" errors. This was fixed upstream in kernel 6.18.4. Rocky 9's stock kernel is 6.12, which is too old.

I tried AMD's amdgpu-dkms package first (which backports the amdgpu driver to older kernels), but the DKMS version is pre-6.18 and doesn't include the KFD fix. No combination of kernel parameters — HSA_ENABLE_SDMA=0, amd_iommu=off, amdgpu.noretry=0, amdgpu.cwsr_enable=0 — works around it. Trust me, I tried them all. You need the actual kernel fix.

The solution: ELRepo's kernel-ml package, which provides mainline kernels packaged for RHEL/Rocky. I installed 6.19.6 and it just worked.

sudo dnf install -y elrepo-release
sudo rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
sudo dnf --enablerepo=elrepo-kernel install -y kernel-ml

2. MES firmware version 0x80

Even with kernel 6.19.6, I was still getting page faults. Cool. The second half of the puzzle is the MES (Micro Engine Scheduler) firmware. Rocky's linux-firmware-20260130 package ships MES version 0x83, which is known to cause ROCm page faults on Strix Halo. The upstream linux-firmware repository explicitly reverted it with the commit message: "MES FW 0x83 is reported to cause ROCm page faults."

Rocky hadn't picked up the revert yet, and AMD's own amdgpu-dkms-firmware package also ships 0x83. So the fix is manual:

# Download good firmware (version 0x80) from upstream revert commit
curl -sL -o /tmp/gc_11_5_1_mes1.bin \
  "https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/amdgpu/gc_11_5_1_mes1.bin?id=c092c7487eb7c3d58697f490ff605bc38f4cc947"
curl -sL -o /tmp/gc_11_5_1_mes_2.bin \
  "https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/amdgpu/gc_11_5_1_mes_2.bin?id=c092c7487eb7c3d58697f490ff605bc38f4cc947"

# Install to updates dir (takes priority over base firmware)
sudo cp /tmp/gc_11_5_1_mes1.bin /lib/firmware/updates/amdgpu/
sudo cp /tmp/gc_11_5_1_mes_2.bin /lib/firmware/updates/amdgpu/

# Rebuild initramfs and reboot
sudo dracut --force /boot/initramfs-$(uname -r).img $(uname -r)
sudo reboot

Verify after reboot:

sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep MES
# Good: firmware version: 0x00000080
# Bad:  firmware version: 0x00000083

Once both pieces were in place, PyTorch passed all validation checks: tensor operations, all data types (fp32, fp16, bf16, int8), 4 GiB memory allocation, and ~1.05 TFLOPS on a 4096x4096 FP32 matmul. Finally.

Lesson learned the hard way: Pin your firmware. I added exclude=linux-firmware* amdgpu-dkms-firmware* to /etc/dnf/dnf.conf to prevent package updates from sneaking MES 0x83 back in. Ask me how I know.

3. Thermal Power Limits

This one might be the most important of the three, so don't skip it.

While setting up a PyTorch benchmarking suite, the system started dying on me. At first I figured "oh weird, the host crashed" — but when I went to check on it, it wasn't just locked up. It was fully powered off. That's... not normal. Then it happened again. And again. Full hard power-off events with no warning, no logs, nothing.

I set up thermal monitoring logging every 5 seconds and caught the cause:

19:00:07  Tctl=71°C   pwr=92W    ← normal inference
19:00:12  Tctl=91°C   pwr=165W   ← torch.compile spike
19:00:22  Tctl=93°C   pwr=164W   ← approaching TjMax (100°C)
19:00:27  Tctl=61°C   pwr=30W    ← thermal shutdown

torch.compile triggers Triton/Inductor kernel compilation that simultaneously hammers all 32 CPU cores and the GPU. On a UMA APU where everything shares one thermal envelope in a mini PC chassis, that produces a 165W power spike — way past the 120W PPT Fast limit and far more than the little cooler can handle. The firmware thermal protection kicks in and just kills power. No graceful shutdown, just off.

Normal LLM inference is totally fine — 73-75W, 76-80°C, perfectly stable all day long. But the moment you hit a mixed CPU+GPU burst workload, you're rolling the dice. And it's not just torch.compile — anything that pegs the CPU and GPU simultaneously in this chassis can trigger it. I lost count of how many times the system just cut out on me before I got this sorted.

The fix is ryzenadj, a tool that lets you adjust AMD mobile power limits from Linux:

sudo ryzenadj --fast-limit=100000 --tctl-temp=88

This caps burst power to 100W and sets the thermal target to 88°C, giving 12°C of headroom before TjMax. Do this immediately after your first boot, before you run anything heavy. I created a systemd service to persist these limits across reboots so they're always active. The GMKtec ships with BIOS 1.12 / EC 1.10 (the latest available), so there's no firmware fix coming — you've gotta manage this in software.

Other thermal improvements people recommend but I haven't tried yet: replacing the stock thermal paste with PTM7950 phase-change material, and the ec_su_axb35 kernel module for Linux fan control. Maybe I'll get to those at some point.

Understanding Unified Memory (It's Unintuitive)

The BIOS has a "UMA Frame Buffer Size" setting that defaults to 64 GB. Your instinct says "big number = more GPU memory = good." Yeah, your instinct is wrong here.

On a traditional discrete GPU, VRAM is physically separate from system RAM. On Strix Halo, there's only one pool of LPDDR5. The BIOS carveout reserves a chunk of that pool as dedicated VRAM — the OS can't see it, can't use it for anything else, and the GPU doesn't even need it because it can access system RAM at the same speed through GTT (Graphics Translation Table).

The optimal configuration is:

BIOS VRAM: 2 GB (the minimum on the GMKtec's current BIOS 1.12 — you'll see guides online saying to set this to 512 MB, but that was only possible on earlier BIOS versions. 2 GB is as low as it goes now.)
GTT: 124 GB (dynamically mapped, shared between CPU and GPU)

This gives you ~124 GB usable for both CPU and GPU workloads, instead of 64 GB locked to GPU + 64 GB for CPU.

The kernel parameters to make this work:

amdgpu.gttsize=126976          # 124 GiB GTT
ttm.pages_limit=29360128       # Allow TTM to manage 112 GiB of pages
ttm.page_pool_size=29360128    # Matching pool size
amdgpu.no_system_mem_limit=1   # Disable SVM resident memory cap
amd_iommu=off                  # Fully disable IOMMU (~4% bandwidth gain)

The ttm.pages_limit parameter is particularly sneaky. Without it, you can set GTT to 124 GB and the kernel will report 124 GB, but HIP/ROCm applications will only see ~62 GiB. The TTM subsystem has its own page limit that must match. And it has to be set at boot — runtime changes don't take effect. That one took a while to figure out.

On Rocky 9, updating kernel parameters has its own gotcha: editing /etc/default/grub and running grub2-mkconfig doesn't work. Rocky 9 uses BLS (Boot Loader Specification) entries, which have their own options line. Use grubby instead:

grubby --update-kernel=DEFAULT --args="amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=29360128 ttm.page_pool_size=29360128 amdgpu.no_system_mem_limit=1"

Building and Running llama.cpp

Ok, with the hardware finally cooperating, I built llama.cpp. I started with ROCm/HIP since that's what everyone recommends for AMD GPUs:

cmake -B build \
  -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 \
  -DGGML_NATIVE=OFF -DCMAKE_C_FLAGS='-march=znver4' -DCMAKE_CXX_FLAGS='-march=znver4' \
  -DGGML_AVX512=ON -DGGML_AVX512_VBMI=ON -DGGML_AVX512_VNNI=ON -DGGML_AVX512_BF16=ON \
  -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_LTO=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

A few build notes:

-DGGML_NATIVE=OFF with explicit -march=znver4 is required because GCC 11 on Rocky 9 emits VNNI instructions that the system's binutils can't assemble. Specifying znver4 explicitly avoids the problematic auto-detection.
The AVX512 flags enable SIMD for CPU-side tensor ops. Zen 5 has full AVX-512 support.
GGML_HIP_ROCWMMA_FATTN enables wave matrix multiply for flash attention.

Critical for APUs: You must set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 before running. Without it, llama.cpp tries to allocate in the 2 GB dedicated VRAM carveout and fails for any model larger than 2 GB. With it, allocations go through the full GTT pool. Don't skip this or you'll be very confused.

The Model

I'm running Qwen3-Coder-Next Q4_K_M — an 80B parameter Mixture-of-Experts model with 3B active parameters, purpose-built for coding agents. At Q4_K_M quantization it's about 46 GiB across 4 GGUF shards, fitting comfortably in 128 GB with room for a 65K token context window.

The Mixture-of-Experts architecture is what makes this hardware viable. An 80B MoE model only needs to stream the active expert weights each token — roughly 3B parameters — not the full 80B. Dense 70B models? They crawl at 5-7 tok/s on this hardware. This 80B MoE? 46 tok/s. Same memory, same bandwidth — the model architecture makes all the difference.

This model scored #1 on SWE-rebench Pass@5 at 64.6%, beating Claude Opus 4.6 (58.3%). Running it locally at interactive speeds on a sub-$3K box (give or take, depending on what RAM prices are doing this week) is... pretty nuts.

Runtime Configuration

I run llama-server as a systemd service with these flags:

-fa on              # Flash attention (smaller KV cache, faster attention)
--parallel 1        # Single slot — all memory for one user
-t 32 -tb 32       # All 32 CPU cores
-ub 2048            # Large ubatch for GPU utilization during prompt processing
-ctk q8_0 -ctv q8_0  # Quantized KV cache (~2x smaller than f16, minimal quality loss)
--mlock             # Pin model in RAM
-c 65536            # 65K context window

Two things I learned about GPU power modes: profile_peak sounds good but actually causes thermal throttling on an integrated GPU sharing the SoC thermal envelope. Generation dropped from 37.9 to 26.9 tok/s. Ouch. Use high instead — it clocks up aggressively but lets the thermal controller do its job.

Tuning: What the Internet Got Wrong

With the system stable, I went through every tuning recommendation I could find — a comprehensive "definitive guide" document and the strixhalo.wiki llama.cpp performance page. I benchmarked each claim individually. A lot of them were wrong, at least for this hardware.

Things that didn't matter

--no-mmap vs --mlock: Identical performance. pp=219.5/tg=37.7 vs pp=218.7/tg=38.0. On a UMA APU where GPU memory is system memory, both approaches effectively do the same thing. Pick whichever you prefer.

-b 256 batch size: Slightly worse than the default -ub 2048. The claimed jump from 70 to 591 tok/s was for Qwen3-30B-A3B, a much smaller model with different memory access patterns. Don't copy batch size settings across models.

ROCBLAS_USE_HIPBLASLT=1: No measurable effect on gfx1151 with this model. The "mandatory" claim may apply to other GPU architectures.

Things that helped a little

amd_iommu=off: Real. Generation speed went from 38.0 to 39.4 tok/s — a 3.7% improvement. Not the claimed 6%, but free performance. I also bumped GTT from 112 GiB to 124 GiB in the same change.

The big discovery: Vulkan beats ROCm

Then I built llama.cpp with Vulkan instead of HIP, just to see what would happen:

cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release

The results were... not subtle:

Context	Vulkan pp (tok/s)	Vulkan tg (tok/s)	HIP pp (tok/s)	HIP tg (tok/s)
Default (512)	548	45.9	336	40.8
32K	394	36.8	91	29.7
65K	305	32.2	54	23.5
100K	213	28.2	36	18.7

Vulkan with RADV (Mesa's open-source Vulkan driver) was 63% faster at prompt processing and 12% faster at generation at default context. The gap widens with context length — at 100K tokens, Vulkan is nearly 6x faster at prompt processing and 51% faster at generation.

This directly contradicts the common advice that "ROCm is better for long-context work." That may be true on datacenter GPUs (MI300X) or older desktop GPUs (gfx1100), but on gfx1151, the HIP compute kernels are known to run 2-6x slower than expected. Vulkan's cooperative matrix support through RADV doesn't have the same problem.

The guides also recommended AMDVLK (AMD's proprietary Vulkan driver) over RADV for 10-15% better performance. I investigated and found that AMD discontinued AMDVLK in September 2025, going all-in on RADV. The strixhalo.wiki's own benchmarks actually show RADV beating AMDVLK even before they killed it. Just use RADV.

One nice bonus: the Vulkan build doesn't need the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 environment variable. That's a HIP/ROCm-specific workaround.

The Boring But Important Stuff

A handful of other things that aren't exciting but tripped me up:

DNF firmware pinning: Added exclude=linux-firmware* amdgpu-dkms-firmware* to /etc/dnf/dnf.conf. Without this, a routine dnf update can reintroduce MES 0x83 and break GPU compute.

EPEL rocminfo conflict: EPEL ships rocminfo 5.4.4 which conflicts with the ROCm 7.2 version from AMD's repo. Fixed with dnf config-manager --save --setopt=epel.excludepkgs=rocminfo.

SELinux and systemd: The llama-server binary must live in /usr/local/bin (not ~/) for SELinux to allow systemd to execute it. Run restorecon -v after copying.

WiFi: The MediaTek MT7925 (Wi-Fi 7) works with WPA2 networks but fails on WPA2/WPA3 mixed-mode SSIDs. Suspected mt7925e driver bug. If your router broadcasts both, you may need a WPA2-only SSID.

GPU performance mode: Set via udev rule to persist across reboots:

echo 'ACTION=="add", SUBSYSTEM=="drm", KERNEL=="card0", ATTR{device/power_dpm_force_performance_level}="high"' \
  | sudo tee /etc/udev/rules.d/99-gpu-perf.rules

What I'd Do Differently

If I was setting this up again from scratch:

Start with Vulkan, not ROCm/HIP. I spent way too much time optimizing the HIP build before discovering Vulkan was faster at everything. Just build llama.cpp with -DGGML_VULKAN=ON from the start.
Install ELRepo kernel immediately. Don't waste time trying to make the stock 6.12 kernel work with DKMS. It can't. I tried.
Check MES firmware before debugging anything else. If rocminfo hangs or GPU compute produces page faults, check MES version first. It's the most common cause and the least obvious one.
Set BIOS VRAM to minimum and maximize GTT from day one. The default 64 GB carveout wastes half your memory for no reason.
Install ryzenadj before you do literally anything else. Seriously. The thermal shutdowns caught me completely off guard and happened repeatedly. The stock power limits on this chassis are not safe for sustained workloads. Cap power first, then start playing with models.

The End Result

My final configuration:

Component	Setting
OS	Rocky Linux 9.7, kernel 6.19.6 (ELRepo)
GPU driver	Mesa RADV 25.0.7 (Vulkan)
MES firmware	0x80 (manually installed)
BIOS VRAM	2 GB (minimum)
GTT	124 GiB
IOMMU	Fully disabled
Power limits	100W burst / 88°C target (ryzenadj)
llama.cpp	Vulkan build, flash attention, q8_0 KV cache
Model	Qwen3-Coder-Next Q4_K_M (80B MoE, 46 GiB)
Context	65K tokens

Performance:

Metric	Speed
Token generation (short context)	45.9 tok/s
Token generation (32K context)	36.8 tok/s
Token generation (65K context)	32.2 tok/s
Token generation (100K context)	28.2 tok/s
Prompt processing (short context)	548 tok/s

For a mini PC that cost me under $3K — though good luck getting that price if LPDDR5 keeps doing what it's been doing — running a frontier-class 80B coding model entirely locally, with 65K context and no API costs? I'm pretty happy with that.

Tested on: GMKtec NucBox EVO-X2, AMD Ryzen AI MAX+ 395, 128 GB LPDDR5, Rocky Linux 9.7, kernel 6.19.6, llama.cpp build f90bd1dd8, Mesa RADV 25.0.7. March 2026.

Rescuing "Unsupported" Enterprise SSDs with Custom MegaRAID Tools

Sat, 31 Jan 2026 09:00:00 GMT

I picked up a pair of refurbished Samsung PM1643a SSDs (3.84TB each) for my Dell R740 Proxmox server. Great deal on enterprise drives, right? Except when I installed them, my PERC H330 controller showed them as "UGUnsp" (Unconfigured Good Unsupported) with a size of... 0 KB.

The drives had been pulled from an enterprise storage array (likely Hitachi or EMC) and were formatted with 520-byte sectors instead of the standard 512. Those extra 8 bytes per sector are used for T10-DIF data integrity protection – useful in big SANs, useless for my homelab.

The usual fixes wouldn't work. sg_format? Can't see the drives. Samsung DC Toolkit? Nope. Perccli format/erase commands? "Operation not allowed." The controller refused to expose them to Linux at all – no /dev/sd*, no /dev/sg*. My only option seemed to be flashing the H330 to IT-mode, which meant unacceptable downtime.

Then I noticed something: smartctl -d megaraid,4 -i /dev/sda could actually talk to the drives via MegaRAID passthrough. The controller wouldn't expose them, but it would relay SCSI commands to them. That was my way in.

With some help from Claude Code, I dug into smartctl's source code and reverse-engineered the MegaRAID IOCTL interface. The result is a set of small C tools that send SCSI FORMAT UNIT and MODE SELECT commands directly through the MegaRAID passthrough – no HBA flash required, no downtime.

Both drives are now happily running at 512-byte sectors, showing their full 3.49 TiB each, and working perfectly as JBOD in Proxmox.

I've open-sourced the tools in case anyone else runs into this: github.com/filthyrake/megaraid_format_tools

Vlog??

Wed, 03 Dec 2025 21:07:35 GMT

Once upon a time I had a youtube channel. I mean I still do but most of my videos are now private and I stopped posting. I dont really want to reactivate my channel and be a YouTuber again, but I also dont want to just toss all that stuff out, so I’m standing up vlog.damenknight.com and migrating MOST of my old YouTube content over.

Now I DO want to create at least some content still – I clearly enjoy it – but doing it here instead of on YT will hopefully let me keep it a bit more chill and maybe more consistent. Head on over, check it out as I get things migrated, and keep your eyes out in the future for all NEW content! More car stuff, more astro stuff, more tech, you name it!

Astrophotography Datasets Site

Mon, 05 May 2025 00:13:28 GMT

For a while now, I’ve been sharing my astrophotography datasets on a really basic mid-90’s-looking site I threw together. It was ugly, it was hard to use, it sucked. But it served its purpose and I was ok with it. I had bigger plans though.

You see, there aren’t a ton of sites that share astrophotography datasets – especially not for free. Many people sell theirs, and the best free options have historically been places like NASA and the ESA. I absolutely love those free resources, but I wanted something for the rest of us and with more variety.

So I’ve spent the past… many many many months working hard to build something new. This was a pain, since I am not a web guy or a software guy and this involved both – and I’m far far from done still – but it is in a place where I’m happy to talk about it and start showing it off.

If you havent yet, head over to check it out: Miscellaneous Datasets. It is fairly filterable by what kind of data you want. All datasets can be downloaded in all the major formats. Right now it is limited to data I’ve personally captured, but I’m hard at work getting more contributors on board. My dream is that someday this will be the largest free non-government dataset resource in the world.

Warewulf Home Lab Setup

Tue, 25 Mar 2025 19:13:18 GMT

Control Node

Obligatory HomeLab Writeup

Sat, 15 Mar 2025 16:41:12 GMT

Don’t judge my local fire-hazard.

I’ll get this out of the way first: You do not want rack servers in your home. They’re *really* loud. I am just a crazy person.

Ok, now that that’s out of the way, let’s talk about what we’ve got. Starting with the items actually installed in the rack from the bottom and working our way up:

Synology RS2421+

Mounting my miniPC and Powerbox

Mon, 29 Jan 2024 14:43:16 GMT

I recently got a Pegasus Pocket PowerBox Advanced for my astro setup, and while I was at it I moved my miniPC up to attach on the telescope instead of onto my mount.

I did this using the accessories from BuckEye Stargazer. Unfortunately, there are not a ton of guides or instructions available for how everything goes together so I went ahead and filmed a quick video going over the process.

Achieving better telescope balance

Thu, 25 Jan 2024 17:25:34 GMT

On the astrophotography Discord server I hang out on I’ve seen lots and lots of folks struggle with balancing their telescope. I think a big part of the reason why this is so difficult is that a lot of the descriptions of what good balance is are vague or non-existent. So I went ahead and made a quick video to help show what “good” balance looks like.

Network Setup

Mon, 22 Jan 2024 18:39:44 GMT

WiFi: NetGear Orbi RBRE960 (AP mode, 3AP Mesh)
Router: Custom MiniPC pfSense Router (10GBit LAN, 2.5GBit WAN)