Strix Halo vs an A30 vs the Frontier: What the miniPC Can (and Can't) Actually Do

A few months ago I wrote about getting a Strix Halo miniPC (~$2.5K all-in) to run a frontier coding model. That led to people asking me how I was measuring/testing models (I really wasn't, beyond what my anecdotal experience was). That led to me starting to actually benchmark stuff - but then I thought... why not throw my A30 GPU in the mix while I'm at it, and really see what's what? They even cost me (roughly) the same.

The short version: I didn't lie! It is roughly like Sonnet when it's a bit slow. BUT, things have changed and now you should use qwen3.6. Also there are some other neat findings ahead ;)

The longer version: quality is hardware-agnostic, which is exactly what you'd expect (run the same model on a Strix Halo APU or a datacenter A30 and you get the same benchmark score within noise, because the hardware should decide how fast, not how smart). The interesting questions were always about speed and fit, and that is where the miniPC shines!

Locally-hosted qwen3.6-thinking (a 35B-A3B Mixture-of-Experts (MoE) model, that is, 35B total parameters with only ~3B activated per token) scores 62.2% on the Aider Polyglot benchmark, sitting right between Claude Sonnet 4 thinking (61.3%) and Claude 3.7 Sonnet thinking (64.9%), and it does that at 45 tok/s with a 65K-token context. That is Sonnet-class capability, at usable speeds, on a ~$2.5K box. It is not Opus-class (72.0%) or GPT-5-class (88.0%), but Sonnet-class coding on hardware I own, at no per-token cost and with nothing leaving my network.

The surprises were both about memory, not compute. First: even when a model fits comfortably in the A30's VRAM, the A30 can still lose at long context. The A30 wins at short context, but on qwen3-30b-a3b at 65K depth the miniPC outruns it by 36% (28.1 vs 20.6 tok/s), because the A30's 24 GiB runs out of room for the KV cache (the attention key/value store that grows with every token of context) while the miniPC's 128 GiB of unified memory shrugs it off (the crossover lands somewhere between 8K and 32K). qwen3-30b-a3b is only 17 GiB, so this isn't even an offload problem, it's all running on the GPU, the A30 just can't hold a big model and a big KV cache at once.

The bigger surprise was the hybrid cliff. For any model that doesn't fit in 24 GiB at all, the miniPC wins outright. qwen3-coder-next (80B-A3B) has to be split across GPU and CPU on the A30, and once half its layers (plus their KV) live in slow DDR4 system memory and get streamed over PCIe every token, it crawls: 3.9 tok/s at 65K against Strix Halo's 38, almost 10x. That gap, a unified-memory box versus a GPU forced into hybrid offload, is the whole reason I bought the thing.

The recommendation has also moved since the last post. Back then, qwen3-coder-next was the best local-fitting coding model I had, and I called its quality "something like Sonnet 4.5 on a slow day." That was ... close enough that I won't say I was wrong, hah (it lands about 9pp below baseline Sonnet 4), but qwen3.6 has since released, and on the same Strix Halo hardware it is straightforwardly better: higher quality, faster, and a smaller VRAM footprint. If you have been running qwen3-coder-next since my last post, switch to qwen3.6.

Contents

The contestants
The benchmarks
Quality: Strix Halo vs A30
Quality: Local vs the Frontier
Coding-specific: Aider Polyglot
Saturated benchmarks: HumanEval+ and lm_eval
Quality at depth: NIAH
Speed: default throughput
Speed at long context
gpt-oss-20b (fits both)
qwen3-30b-a3b-2507 (fits both)
qwen3.6-thinking (Strix Halo only, Sonnet-tier model)
qwen3-coder-next 80B-A3B (Strix Halo UMA vs A30 hybrid offload)
DeepSeek-Coder-V2-Lite, the bonus weird result
Bonus models (Strix Halo coverage only)
What each box is actually best for
What I'd do differently
The end result

The contestants

Spec	Strix Halo (the miniPC)	A30
GPU/APU	AMD Ryzen AI MAX+ 395 (gfx1151)	NVIDIA A30 PCIe
Architecture	RDNA3.5-class iGPU, unified memory (UMA)	Ampere GA100, dedicated PCIe
Memory	128 GiB LPDDR5 (shared CPU+GPU)	24 GiB HBM2
Backend	llama.cpp Vulkan (RADV)	llama.cpp CUDA
TDP (effective)	100W (after `ryzenadj`)	165W
Host system	GMKtec NucBox EVO-X2	R740 / Proxmox VM (16c/64G)
Approx all-in cost	~$2,500	~$2-3K used (card only; bring your own server)
Largest model that fits	gpt-oss-120B (~64 GiB), qwen3-coder-next 80B-A3B (~45 GiB)	qwen2.5-coder-32B (Q4_K_M) at most

The "what fits" row is the most important and most under-discussed difference, with one clarification: it means fits entirely in VRAM, at full speed. You can push a bigger model onto the A30 with hybrid GPU/CPU offload (I do exactly that later in the post), it just runs much slower once part of it spills out of the 24 GiB. The miniPC's unified memory has no such cliff: anything up to 128 GiB loads and runs on the iGPU at full speed, including a 120B model the A30 can't hold in VRAM at all.

It's worth noting what's doing the spilling on the A30 side: its host is an older Dell R740 on DDR4, so when a model overflows into system RAM, DDR4 bandwidth is the bottleneck. A newer DDR5 host with more memory channels would lift the hybrid numbers, but it would also cost more, which is sort of the point: the unified-memory miniPC sidesteps the spill entirely, for the price of a single mid-range box.

The benchmarks

Four things, each measuring something different:

Aider Polyglot, 225 multi-language Exercism coding exercises, the model is asked to edit existing files to make tests pass. This is the only benchmark on the list that resembles real-world agentic coding work, and it's the one frontier models actually struggle with. Not saturated.
HumanEval+, function-level code generation, 164 problems. Top models all score 90%+. Saturated.
lm_eval (gsm8k, ifeval, mmlu_pro), knowledge and instruction-following at single-prompt level. Frontier models saturate this too.
llama-bench, pure throughput, no quality signal. Two numbers matter, both in tokens/sec: pp (prompt processing, how fast the model ingests your prompt) and tg (token generation, how fast it writes the reply). I report the defaults pp512 / tg128 (a 512-token prompt, 128 generated tokens) plus depth tests for long-context behavior.

I treat Polyglot as the load-bearing quality metric because (a) it actually discriminates, and (b) it's what I care about, agentic coding is what these boxes get used for in practice. If there's a benchmark I didn't run, it is because I don't know about it or didn't think of it.

One reading convention for every table below: higher is better, unless I explicitly say otherwise.

Quality: Strix Halo vs A30

Nobody expects a model to get smarter or dumber depending on whether it runs on AMD or NVIDIA silicon, and it doesn't. This was never really an open question. But I had both boxes and was running the benchmarks anyway, so I figured I'd confirm it and see where any real differences showed up. The answer: quality is the same within noise, with one model-specific surprise. Boring but necessary setup for everything that follows.

Model	Strix Halo (Vulkan)	A30 (CUDA)
gemma-3-27b-it (HumanEval+ p@1+)	78.7%	77.4%
qwen2.5-coder-32b (HumanEval+ p@1+)	85.4%	86.6%
qwen3-30b-a3b-2507 (HumanEval+ p@1+)	89.0%	89.6%
qwen2.5-coder-32b (Polyglot)	25/225 (11.1%)	25/225 (11.1%)
qwen3.6 (Polyglot, no think)	121/225 (53.8%)	106/225 (47.1%)

("p@1+" is pass@1 on EvalPlus's extended test set, meaning the model's first answer has to pass every test.)

Within noise on HumanEval+ across the board, and on the qwen2.5-coder Polyglot row. The qwen3.6 Polyglot row shows a 6.7pp cross-host gap (53.8% Strix Halo vs 47.1% A30), which is larger than I'd expect from pure sampling noise; possibly a real CUDA-vs-Vulkan difference for that specific model and harness, or a build-version skew between the two boxes. The HumanEval+ gemma/qwen2.5/qwen3-30b rows on the same model files agree exactly cross-host, so it isn't a general "the A30 produces worse logits" pattern; it's a qwen3.6-Polyglot-specific finding I'd want to dig into in a future bench.

So the model is mostly the model, and hardware doesn't make it dumber in any general sense. There can be model-specific cross-host quirks worth checking (this one came as a surprise to me), but for the typical case, once you've picked a model that fits, the hardware question reduces to how fast and can it even fit.

Quality: Local vs the Frontier

Here's where it gets fun. I'm going to split this into coding and non-coding because they behave very differently.

Coding-specific: Aider Polyglot

Polyglot is the benchmark where frontier models still have headroom, and the one that tracks "how good is this thing as a coding agent." Here's the comparison (Aider leaderboard scores for the API models, my results for local):

Model	Polyglot pass rate	Notes
GPT-5 (high)	88.0%	API
Gemini-2.5-Pro (32k think)	83.1%	API
DeepSeek-V3.2 Reasoner	74.2%	API (open weight ~700B, won't fit my hardware)
Claude Opus 4 (32k think)	72.0%	API
Claude Opus 4 (no think)	70.7%	API
Claude 3.7 Sonnet (32k think)	64.9%	API
qwen3.6-thinking (Strix Halo)	62.2%	local, 35B-A3B MoE
Claude Sonnet 4 (32k think)	61.3%	API
Claude 3.7 Sonnet (no think)	60.4%	API
Claude Sonnet 4 (no think)	56.4%	API
qwen3.6 (Strix Halo, no think)	53.8%	local
qwen3-coder-next (Strix Halo)	47.6%	local, 80B-A3B MoE; doesn't fit on A30, see the Speed sections
qwen3.6 (A30, no think)	47.1%	local
GPT-OSS-120B (high)	41.8%	leaderboard score, API
qwen3-30b-a3b-2507 (Strix Halo)	30.2%	local
qwen3-30b-a3b-2507 (A30)	28.9%	local
GPT-OSS-20B-thinking (A30)	16.9%	local
GPT-OSS-120B (Strix Halo, Q4_K_M)	1.8%*	local, almost certainly broken locally, see footnote

* The 23× gap between local gpt-oss-120B (1.8%) and the same model's API leaderboard score (41.8%) is almost certainly the reasoning_effort parameter not wiring through to llama.cpp's gpt-oss path: low/medium/high produce near-identical outputs within sampling noise. For a model whose top-line capability is its reasoning depth, a broken reasoning knob is a broken model. Full discussion in item 3 below.

(All API model scores in this table come from the Aider Polyglot leaderboard, last updated 2025-11-20. A few newer frontier releases (Google's Gemini 3 and Anthropic's Claude Opus 4.5 / 4.7) exist but haven't been scored by the Aider team yet, so they aren't represented above. The most recent Gemini and Opus variants the leaderboard does have are Gemini 2.5 Pro 32k-think at 83.1% and Claude Opus 4 32k-think at 72.0%.)

What this shows:

Sonnet-class is achievable locally, in both thinking and no-think modes. My best local model (qwen3.6-thinking, a 35B-A3B MoE) sits right in the Claude Sonnet thinking band (62.2% vs Sonnet 4 thinking 61.3%). And on the apples-to-apples no-think comparison, qwen3.6 with thinking off (53.8%) is just 2.6pp under Claude Sonnet 4 no-think (56.4%); effectively tied within Polyglot's noise floor. So it's not just "Sonnet-class when allowed to think"; it's "Sonnet-class without needing to think." That second result was the bigger surprise.
The recommendation has moved since my last post. Back then, qwen3-coder-next (80B-A3B) was the best local-fitting coding model I had, and the explicit subject of the previous post. qwen3.6 didn't exist yet. Now it does, and it's straightforwardly better: 53.8% Polyglot at thinking-off (vs qwen3-coder-next's 47.6%), 62.2% at thinking-on, smaller VRAM footprint, faster throughput. If you've been running qwen3-coder-next on Strix Halo since my last post: try qwen3.6.
The real gap is to GPT-5, Gemini 2.5 Pro, and Claude Opus. Those three are ~10-26pp ahead of my best local model. The Anthropic ladder is worth calling out specifically: qwen3.6-thinking (62.2%) is essentially tied with Sonnet 4 thinking (61.3%), but Anthropic's actual flagship is Opus, which scores 72.0%, about 10pp ahead of local. Then GPT-5 (88.0%) and Gemini 2.5 Pro thinking (83.1%) are the real top of the leaderboard. DeepSeek V3.2 Reasoner (74.2%) is the closest open-weight to that band, but at ~700B parameters it won't fit on either of my boxes.
Local quants underperform their API counterparts catastrophically on some models. My local gpt-oss-120B Q4_K_M scored 1.8%; the leaderboard's gpt-oss-120b (high) scored 41.8%. That's a 23x gap, not a small one. Three things contribute: quantization, the reasoning_effort parameter doesn't actually wire through to the model on llama.cpp (I verified this; low/medium/high produce near-identical outputs within sampling noise), and I used Aider's whole edit format vs the leaderboard's diff. The reasoning-effort issue is probably the biggest factor; gpt-oss is essentially a reasoning model, and if the reasoning depth knob is broken, the model is operating in something close to a "low effort" mode regardless of what you pass in.
Thinking mode is meaningful when measured correctly. qwen3.6 without thinking: 53.8%. With thinking: 62.2%. That's 8.4 pp of capability sitting behind a flag.

A caveat about polyglot versioning

My test harness ran 225 exercises on most models, the same set as Aider's leaderboard. A few runs got 289 or 450 (multiple attempts per exercise from a config tweak); rates are still computed as passed/total. Edit-format matters too: I used whole because it's more robust to weaker models, while Aider's leaderboard uses diff because it gets better scores from the top models. whole is generally a slight handicap. Treat the comparisons as directional, not exact.

Methodology note: all HumanEval+ numbers in this post come from evalplus.codegen, the canonical scorer behind EvalPlus's published leaderboard.

Saturated benchmarks: HumanEval+ and lm_eval

These are the benches where frontier and local models all score in the same 85-95% range, they don't discriminate well anymore. Quick look:

Cross-host gsm8k + ifeval, identical Q4_K_M quantization, identical chat-completions API, 200 items each:

Model	Strix Halo gsm8k	A30 gsm8k	Strix Halo ifeval	A30 ifeval
gemma-3-4b-it	78.5%	82.5%	69.5%	68.0%
qwen3-4b-2507	90.0%	90.0%	79.5%	80.5%
gemma-3-12b-it	92.5%	91.0%	72.0%	72.5%
gemma-3-27b-it	93.5%	93.5%	77.5%	76.0%
qwen3-30b-a3b-2507	95.5%	93.5%	79.5%	80.5%
qwen2.5-coder-32b	95.0%	95.0%	75.0%	75.0%
phi-4 (14B)	90.5%	91.0%	56.5%	57.0%
mistral-small-3.2-24b	95.0%	94.5%	76.5%	72.0%
qwen3.6-thinking	96.5%	n/a (model not on A30)	79.0%	n/a
gpt-oss-20b (reasoning off)	87.5%	n/a (different run config)	25.5%	n/a

The cross-host rows agree within 1-2 percentage points across the board. Same model, same quantization, same prompt, same score within noise.

Notable observations:

qwen3.6-thinking tops gsm8k at 96.5%, better than any A30 result in this set, and in the same ~95-97% band frontier models hit on saturated math benches before the frontier moved on to AIME / FrontierMath. On a 35B-A3B MoE running on a miniPC.
gpt-oss-20b ifeval at 25.5% is shockingly low for a model that hits 87.5% on gsm8k in the same run. This is the --reasoning off configuration. The other gpt-oss-20b runs in my data, reasoning on variants, also fall in the 25-36% ifeval band, so this isn't a reasoning-flag artifact; gpt-oss-20b just struggles with strict prompt-following regardless. Worth knowing if you were planning to deploy it for instruction-bound tasks.
phi-4 inverted profile: 90.5% gsm8k but only 56.5% ifeval. It's a math-strong, instruction-weaker model. Useful data point for choosing models by use case.

Same story as the coding benches: local models are basically tied with frontier on saturated benchmarks. A Qwen3-30B-A3B on a miniPC scores 95.5% on gsm8k, comfortably in the same band as any frontier model that's still being measured against gsm8k. The frontier moat still exists, but it's on real-world agentic coding (Polyglot).

Quality at depth: NIAH

Throughput at 65K context is meaningless if the model can't actually find anything at 65K. I tested needle-in-haystack retrieval (single-needle: ask "what's the best thing to do in San Francisco?" with a sandwich-and-Dolores-Park needle planted at 10%, 50%, or 90% depth in a haystack of Paul Graham essays):

Model	Host	Pass rate (4K/16K/32K/60K × 3 depths)
qwen3.6-thinking	Strix Halo	100% (12/12)
qwen3-coder-next	Strix Halo	100% (12/12)
qwen3-30b-a3b-2507	A30	100% (12/12)
qwen2.5-coder-32b	A30	100% (12/12)
gemma-3-27b-it	A30	100% (12/12)
phi-4	A30	100% (12/12)
mistral-small-3.2-24b	A30	100% (12/12)
llama-4-scout-17b-16e	A30	100% (12/12)
qwen3-4b-2507	A30	100% (12/12)
gpt-oss-20b	A30	91.7% (11/12)
granite-3.1-8b-instruct	A30	88.9% (failed at depth)
deepseek-coder-v2-lite	A30	33.3% (4/12, all 4K passes, 1 of 3 at 16K, every 32K and 60K cell timed out at 600s), same root cause as the llama-bench MLA cliff: CUDA-on-MLA is too slow at depth to finish a single query inside any reasonable budget

Top-row finding: Strix Halo's qwen3.6-thinking and qwen3-coder-next both score perfect retrieval at 60K context, with response times of 1-2 min per query. The model isn't just running with that context, it's actually using it. Combined with the throughput numbers, this is what makes the miniPC a real coding-agent target rather than a benchmark curiosity.

Speed: default throughput

Quality matters; speed matters more than people think. A 62% model running at 1 tok/s is unusable. A 50% model at 80 tok/s is a daily driver.

(Methodology note before the tables: every Strix Halo throughput number below was collected with no other model servers running, fans pinned to max, and free memory verified before each run. There's a bench wrapper now that refuses to start without those conditions met. I ended up writing it after melting the poor machine twice, details in What I'd do differently below.)

Default pp512 / tg128 numbers (Q4_K_M, -fa 1, Strix Halo on q8_0 KV / A30 on q4_0 KV, see longctx section for the protocol note). Throughput is in tokens/sec, so higher is better. The last column is the one place bigger isn't better: it's the A30/miniPC tg ratio, where above 1.0 means the A30 is faster and below 1.0 means the miniPC wins (I flag those rows inline).

Model	Strix Halo pp	A30 pp	Strix Halo tg	A30 tg	A30/Strix Halo tg ratio
qwen3-4b-2507	2048	3934	75.0	118.9	1.59x
gemma-3-4b-it	2257	4431	74.1	109.5	1.48x
gemma-3-12b-it	750	1590	26.9	52.4	1.95x
phi-4 (14B)	652	1452	24.1	53.2	2.21x
gpt-oss-20b	1287	2805	80.8	130.2	1.61x
mistral-small-3.2-24b	267	905	15.3	34.5	2.26x
gemma-3-27b-it	230	771	12.6	28.0	2.23x
qwen3-30b-a3b-2507 (MoE)	1167	2274	87.0	136.2	1.56x
qwen2.5-coder-32b	186	633	11.1	24.1	2.18x
qwen3-coder-next (80B-A3B)†	551	110	56.4	12.2	0.22x ← Strix Halo wins 4.6x
qwen3.6 (35B-A3B MoE)	944	1933	67.1	99.9	1.49x

† The A30 row for qwen3-coder-next is hybrid GPU/CPU offload (22 of 49 layers on GPU, the rest on CPU/RAM). The 45 GiB Q4_K_M model can't fit fully in 24 GiB VRAM, so this is what you get if you force it onto the A30 anyway, the apples-to-apples speed cost of exceeding the VRAM ceiling on a dedicated GPU.

Two stories here:

1. A30 wins at default by 2-3x. Expected, a dedicated GPU with proper VRAM and CUDA kernels should beat an APU running Vulkan. The factor is consistent across dense models in the 2.2-2.8x range.

2. MoE narrows the gap and makes the miniPC viable. Look at qwen3-30b-a3b-2507: A30/Strix Halo ratio is just 1.56x for tg, the smallest gap in the table among the bigger models. That's because the model only activates ~3B params per token. Memory bandwidth matters more than raw compute for tg, and Strix Halo's UMA gives it surprisingly good bandwidth for active-parameter-light workloads. (The 4B models also show ratios below 2x, small models stop benefiting from the A30's compute headroom because they're already bandwidth-bound on both boxes.)

Compare that to the dense qwen2.5-coder-32b: 11.1 tok/s on Strix Halo vs 24.1 on A30, still a 2.18x gap but the absolute number is terrible on Strix Halo. I don't know about the rest of you, but 11 tok/s on a 32B dense model is not exactly what I'd call "usable". I'd never reach for the dense coder if a comparable-quality MoE exists.

Speed at long context

Now the fun part. Wait, I already said that. Another fun part! Coding agents send long context (the codebase, the test results, previous turns), so what happens when you push the depth?

I ran the same pp512 / tg128 test at depths 0 / 8K / 32K / 65K. Strix Halo is benched with q8_0 KV cache (matches how the production llama-servers are deployed). A30's previously-collected longctx sweep was at q4_0 KV; the small protocol asymmetry is mildly conservative for Strix Halo at depth (q4_0 saves a bit of KV bandwidth at the cost of dequant overhead, within MC noise on this hardware, but if anything it shaves a few percent off the Strix Halo side at deep contexts).

gpt-oss-20b (fits both)

Depth	Strix Halo pp	A30 pp	Strix Halo tg	A30 tg	A30/Strix Halo tg
0 (default)	1287	2805	80.8	130.2	1.61x
8K	958	2522	66.6	109.5	1.64x
32K	547	1933	56.9	77.2	1.36x
65K	338	1452	45.6	54.5	1.20x

A30 tg dropped 58% from default to 65K depth (130 to 55 tok/s). Strix Halo tg dropped 44% over the same range (81 to 46 tok/s). A30 still wins on this model at every depth, but the lead shrinks dramatically as context grows, the A30/Strix Halo ratio compresses from 1.61x at default to 1.20x at 65K.

qwen3-30b-a3b-2507 (fits both)

Depth	Strix Halo pp	A30 pp	Strix Halo tg	A30 tg	A30/Strix Halo tg
0 (default)	1167	2274	87.0	136.2	1.56x
8K	533	1746	62.1	72.5	1.17x
32K	205	1012	40.2	35.1	0.87x ← Strix Halo wins
65K	110	631	28.1	20.6	0.73x ← Strix Halo wins by 36%

This is where it gets spicy. A30 tg dropped 85% from default to 65K (136 to 21 tok/s), the 24 GiB VRAM ran out of room for a meaningful KV cache at depth. Strix Halo tg dropped 68% over the same range (87 to 28 tok/s), painful but consistent. Crossover happens between 8K and 32K context. At 32K the miniPC is already faster; at 65K it's 36% faster than the dedicated GPU.

The model itself is 17 GiB Q4_K_M. The A30 has 24 GiB of VRAM. At 65K context the KV cache plus activations plus the model are competing for that 7 GiB headroom, and CUDA's memory management gets bottlenecked. Strix Halo's 128 GiB UMA doesn't care, there's so much memory headroom that the only constraint is compute and bandwidth, both of which degrade gracefully.

qwen3.6-thinking (Strix Halo only, Sonnet-tier model)

This is the model I'd actually use for coding. The numbers are remarkable:

Depth	Strix Halo pp	Strix Halo tg
0 (default)	944	67.1
8K	790	61.8
32K	517	55.6
65K	349	45.5

tg drops 32% from default to 65K depth (67.1 to 45.5 tok/s). A Sonnet-class model running locally at 45 tok/s with a 65K-token context window. That's actually usable for serious agentic coding, you can pack a meaningful chunk of a codebase into the context and not pay a brutal speed tax for it.

A note on Q8_0: I also ran the no-think qwen3.6 at Q8_0 (38 GiB on disk vs Q4_K_M's 20 GiB). Polyglot moved from 53.8% to 56.9%, a ~3 pp gain. Throughput dropped from 65 tok/s to 50 tok/s at default and is similarly proportional at depth. So if you have the disk and want every last point of Polyglot, Q8_0 is a real upgrade. If you'd rather have the speed, Q4_K_M is the right call, the quality gap is small relative to the speed cost.

qwen3-coder-next 80B-A3B (Strix Halo UMA vs A30 hybrid offload)

The 80B-A3B that motivated the last post. At 45 GiB Q4_K_M it doesn't fit in 24 GiB VRAM, so the A30 column here is hybrid GPU/CPU offload (-ngl 22, 22 of 49 layers on GPU, the rest streamed from system RAM). Strix Halo's 128 GiB UMA swallows the full model and runs entirely on the iGPU:

Depth	Strix Halo pp	A30 hybrid pp	Strix Halo tg	A30 hybrid tg	A30/Strix Halo tg
0 (default)	551	110	56.4	12.2	0.22x ← Strix Halo wins 4.6x
8K	500	109	52.5	9.3	0.18x ← Strix Halo wins 5.6x
32K	372	109	46.9	5.4	0.12x ← Strix Halo wins 8.7x
65K	256	106	38.1	3.9	0.10x ← Strix Halo wins 9.8x

This is the clearest "wrong tool for the job" result I had. The A30 is a good card, it just doesn't have enough VRAM to hold the model, and PCIe bandwidth between GPU and host RAM is roughly 30x slower than the A30's own HBM2. So every token has to drag activations across that bottleneck.

The math: A30 hybrid tg falls from 12.2 to 3.9 tok/s (a 68% drop) over the depth sweep, while Strix Halo's UMA tg falls from 56.4 to 38.1 (only 32%). The A30 falls off twice as steeply because attention has to read the full KV cache to produce each new token, and on hybrid mode roughly half the model's layers, plus their slice of the KV cache, live in CPU RAM (DDR4, on this server). Each token's attention op pays PCIe-bandwidth overhead, and that overhead scales with context length. So 4.6× at default and 9.8× at 65K.

On the Strix system the story is the other way around: the iGPU has the same bandwidth to all 128 GiB as it does to the first 16 GiB. There's no VRAM cliff to fall off because there's no VRAM/RAM distinction at all. tg drops 32% from default to 65K (56.4 to 38.1 tok/s), painful but consistent, and at 38 tok/s with 65K of context loaded it's still... not fast, but usable.

(I also tried to run Aider Polyglot on A30 hybrid for a quality cross-check; the harness's per-call timeout repeatedly fired against the 3.9-9.3 tok/s hybrid response rate, and I abandoned the run after 9 of 225 exercises in ~5 hours. Throughput data above is from llama-bench directly, which doesn't have that problem.)

DeepSeek-Coder-V2-Lite, the bonus weird result

I benchmarked this one for completeness, expecting nothing exciting. Instead I found one of the clearest "the dedicated GPU is broken here" results in the whole sweep. DeepSeek-V2's Multi-head Latent Attention (MLA) uses a low-rank-projected KV cache that's smaller than standard MHA but requires a different attention kernel. The CUDA implementation in llama.cpp build 9064 falls off a cliff once any KV is present:

Depth	Strix Halo pp (Vulkan)	A30 pp (CUDA)	Strix Halo tg	A30 tg	A30/Strix Halo tg
0 (default)	1641	408	106.0	88.5	0.83x ← Strix Halo wins
8K	1032	17	64.4	4.9	0.08x ← Strix Halo wins 13x
32K	484	wedged	30.8	wedged	n/a
65K	250	wedged	17.4	wedged	n/a

The A30 bench actually wedged my harness, at d=32K, the CUDA kernel grinds at ~3-5 tok/s prefill, which means a single measurement of the 32K-token prefill would take 100+ minutes. I killed it after 17 minutes of no progress.

Strix Halo's Vulkan path handles MLA at depth normally, degrading from 106 to 17 tok/s tg is a real cliff, but it's a finite one and the bench actually finishes. Even at d=0 Strix Halo is 4× faster on pp512 (1641 vs 408), and that's before any KV is in play. The CUDA backend isn't just slow at depth on this architecture, it's just slow on this architecture.

This isn't a hardware issue, I don't think, it's a software bug. Presumably to be fixed in some future llama.cpp release lol. But for anyone considering DeepSeek-V2-family models for coding right now the miniPC is the only sensible target. A 24 GiB A30 will load the model just fine and then be fairly unusable.

Bonus models (Strix Halo coverage only)

For completeness, three more models I benchmarked on Strix Halo to fill out the table:

Model	Default pp	Default tg	65K pp	65K tg	Note
granite-3.1-8b-instruct	996	39.5	(crashed)	(crashed)	Vulkan device-lost at d=65K, got d=0/8K/32K only
llama-4-scout-17b-16e	159	20.1	105	13.9	17B-active, 109B-total, slowest in the post but flattest depth scaling (only 31% tg drop)

What each box is actually best for

Strix Halo as a coding agent: qwen3.6 with thinking on when I want quality, qwen3.6 with thinking off when I want speed/quality balance. Same model file, same throughput, just flip the --reasoning flag.
A30 for serving small concurrent requests: gpt-oss-20b at 130 tok/s or qwen3-30b-a3b at 136 tok/s is great for embeddings, rerank, and utility models in a stack.

These are different jobs. The boxes aren't substitutes; they're complements.

What I'd do differently

Update everything to the latest first. I spent a week chasing scores that looked too low only to realize my llama.cpp was 700 commits behind on reasoning-channel handling. Thinking models scored 0% on lm_eval because the reasoning content was consuming the entire context budget. A rebuild fixed it. This stuff moves fast, llama.cpp lands fixes weekly, so pull and rebuild to the latest before you trust a single number.
Bench with -d from the start, not -c. The -c arg got removed from llama-bench in recent builds; the replacement is -d for testing tg at a given KV depth. My first A30 long-context sweep died at parse time. Trivial fix in retrospect, but it cost me half a day.
Don't trust HumanEval+ as a discriminator. Everything competent scores 85%+. The bench doesn't separate "okay" from "great." Polyglot is what actually matters; I should have run it first.
Run whole and diff edit formats both. I ran everything in whole because it's robust for weak models. That makes the strong-model comparisons against Aider's leaderboard (which uses diff) slightly unfair to the local models. Doing both would have given a cleaner local-vs-API comparison.
Treat thermals and bench cleanliness as first-class concerns. Two specific traps cost me roughly a week of redo work:
- Don't re-make the same thermal mistakes as last time. I already worked this box's thermals out in the last post: sustained GPU load trips it unless you cap power with ryzenadj and pin the fans manually, because the stock fan curve is tuned for desktop bursts, not back-to-back benchmarks holding the GPU near 100% for minutes at a time. Then I forgot to actually turn any of that on before kicking off a multi-hour sweep, and crashed the box twice (no kernel log, just unreachable until a power-cycle) rediscovering a lesson I'd already written down. The fix was the one I already had on the shelf: mode=fixed level=5 on all three fans (under /sys/class/ec_su_axb35/fan*/) before any sustained workload. The wrapper now refuses to start a bench unless the fans are confirmed above 3500 RPM.
- Keep other model servers cleared out the whole time, not just at the start. Any concurrent llama-server process --mlock's its model into RAM and steals memory bandwidth from the bench. I caught this when a spot-check tg128 re-run came in 5% higher than the recorded number with everything else stopped. Five percent is small enough to miss in a single run and big enough to materially change rankings across models. The real trap is that it's easy to start clean and then let stray servers creep back in over a long session, so the fix isn't a one-time cleanup, it's re-verifying nothing else is loaded before every single run. Every Strix Halo throughput number in this post was collected that way, and the wrapper enforces it as a precondition.
The meta lesson: a bench harness that requires you to remember the discipline will eventually run dirty. Make the harness refuse to run unless the conditions are met.

The end result

Use the right tool for the job. Shocking, I know.

The miniPC can be a Sonnet-tier coding agent (when running the right model) that costs about $2,500 once and never sends my code anywhere. The A30 box is for smaller task-specific models that need high throughput.

The local-vs-frontier gap is still real on the hardest problems and on real agentic Polyglot work, but it's roughly Sonnet-class for daily-driver coding tasks, and the gap is closing. The next time someone benchmarks this, I expect the frontier-API moat to be at least a little bit smaller.

*Footnote: the reasoning_effort parameter not wiring through to llama.cpp's gpt-oss path is documented elsewhere; I verified by running effort=low/medium/high through lm_eval gsm8k and getting near-identical scores (90% / 86% / 86%), within the sampling noise band for a 200-item subset. If the flag were actually doing anything, I'd expect monotonic improvement from low to high; instead "high" is the same as "medium" and "low" comes out higher than both, which only makes sense if all three are effectively the same configuration plus sampling noise. A separate post about this might be coming.*