Strix Halo vs an A30 vs the Frontier: What the miniPC Can (and Can't) Actually Do
A few months ago I wrote about getting a Strix Halo miniPC (~$2.5K all-in) to run a frontier coding model. That led to people asking me how I was measuring/testing models (I really wasn't, beyond what my anecdotal experience was). That led to me starting to actually benchmark stuff - but then I thought... why not throw my A30 GPU in the mix while I'm at it, and really see what's what? They even cost me (roughly) the same.
The short version: I didn't lie! It is roughly like Sonnet when it's a bit slow. BUT, things have changed and now you should use qwen3.6. Also there are some other neat findings ahead ;)
The longer version: quality is hardware-agnostic, which is exactly what you'd expect (run the same model on a Strix Halo APU or a datacenter A30 and you get the same benchmark score within noise, because the hardware should decide how fast, not how smart). The interesting questions were always about speed and fit, and that is where the miniPC shines!
Locally-hosted qwen3.6-thinking (a 35B-A3B Mixture-of-Experts (MoE) model, that is, 35B total parameters with only ~3B activated per token) scores 62.2% on the Aider Polyglot benchmark, sitting right between Claude Sonnet 4 thinking (61.3%) and Claude 3.7 Sonnet thinking (64.9%), and it does that at 45 tok/s with a 65K-token context. That is Sonnet-class capability, at usable speeds, on a ~$2.5K box. It is not Opus-class (72.0%) or GPT-5-class (88.0%), but Sonnet-class coding on hardware I own, at no per-token cost and with nothing leaving my network.
The surprises were both about memory, not compute. First: even when a model fits comfortably in the A30's VRAM, the A30 can still lose at long context. The A30 wins at short context, but on qwen3-30b-a3b at 65K depth the miniPC outruns it by 36% (28.1 vs 20.6 tok/s), because the A30's 24 GiB runs out of room for the KV cache (the attention key/value store that grows with every token of context) while the miniPC's 128 GiB of unified memory shrugs it off (the crossover lands somewhere between 8K and 32K). qwen3-30b-a3b is only 17 GiB, so this isn't even an offload problem, it's all running on the GPU, the A30 just can't hold a big model and a big KV cache at once.
The bigger surprise was the hybrid cliff. For any model that doesn't fit in 24 GiB at all, the miniPC wins outright. qwen3-coder-next (80B-A3B) has to be split across GPU and CPU on the A30, and once half its layers (plus their KV) live in slow DDR4 system memory and get streamed over PCIe every token, it crawls: 3.9 tok/s at 65K against Strix Halo's 38, almost 10x. That gap, a unified-memory box versus a GPU forced into hybrid offload, is the whole reason I bought the thing.
The recommendation has also moved since the last post. Back then, qwen3-coder-next was the best local-fitting coding model I had, and I called its quality "something like Sonnet 4.5 on a slow day." That was ... close enough that I won't say I was wrong, hah (it lands about 9pp below baseline Sonnet 4), but qwen3.6 has since released, and on the same Strix Halo hardware it is straightforwardly better: higher quality, faster, and a smaller VRAM footprint. If you have been running qwen3-coder-next since my last post, switch to qwen3.6.
Contents
- The contestants
- The benchmarks
- Quality: Strix Halo vs A30
- Quality: Local vs the Frontier
- Coding-specific: Aider Polyglot
- Saturated benchmarks: HumanEval+ and lm_eval
- Quality at depth: NIAH
- Speed: default throughput
- Speed at long context
- gpt-oss-20b (fits both)
- qwen3-30b-a3b-2507 (fits both)
- qwen3.6-thinking (Strix Halo only, Sonnet-tier model)
- qwen3-coder-next 80B-A3B (Strix Halo UMA vs A30 hybrid offload)
- DeepSeek-Coder-V2-Lite, the bonus weird result
- Bonus models (Strix Halo coverage only)
- What each box is actually best for
- What I'd do differently
- The end result
The contestants
| Spec | Strix Halo (the miniPC) | A30 |
|---|---|---|
| GPU/APU | AMD Ryzen AI MAX+ 395 (gfx1151) | NVIDIA A30 PCIe |
| Architecture | RDNA3.5-class iGPU, unified memory (UMA) | Ampere GA100, dedicated PCIe |
| Memory | 128 GiB LPDDR5 (shared CPU+GPU) | 24 GiB HBM2 |
| Backend | llama.cpp Vulkan (RADV) | llama.cpp CUDA |
| TDP (effective) | 100W (after ryzenadj) |
165W |
| Host system | GMKtec NucBox EVO-X2 | R740 / Proxmox VM (16c/64G) |
| Approx all-in cost | ~$2,500 | ~$2-3K used (card only; bring your own server) |
| Largest model that fits | gpt-oss-120B (~64 GiB), qwen3-coder-next 80B-A3B (~45 GiB) | qwen2.5-coder-32B (Q4_K_M) at most |
The "what fits" row is the most important and most under-discussed difference, with one clarification: it means fits entirely in VRAM, at full speed. You can push a bigger model onto the A30 with hybrid GPU/CPU offload (I do exactly that later in the post), it just runs much slower once part of it spills out of the 24 GiB. The miniPC's unified memory has no such cliff: anything up to 128 GiB loads and runs on the iGPU at full speed, including a 120B model the A30 can't hold in VRAM at all.
It's worth noting what's doing the spilling on the A30 side: its host is an older Dell R740 on DDR4, so when a model overflows into system RAM, DDR4 bandwidth is the bottleneck. A newer DDR5 host with more memory channels would lift the hybrid numbers, but it would also cost more, which is sort of the point: the unified-memory miniPC sidesteps the spill entirely, for the price of a single mid-range box.
The benchmarks
Four things, each measuring something different:
- Aider Polyglot, 225 multi-language Exercism coding exercises, the model is asked to edit existing files to make tests pass. This is the only benchmark on the list that resembles real-world agentic coding work, and it's the one frontier models actually struggle with. Not saturated.
- HumanEval+, function-level code generation, 164 problems. Top models all score 90%+. Saturated.
- lm_eval (gsm8k, ifeval, mmlu_pro), knowledge and instruction-following at single-prompt level. Frontier models saturate this too.
- llama-bench, pure throughput, no quality signal. Two numbers matter, both in tokens/sec: pp (prompt processing, how fast the model ingests your prompt) and tg (token generation, how fast it writes the reply). I report the defaults
pp512 / tg128(a 512-token prompt, 128 generated tokens) plus depth tests for long-context behavior.
I treat Polyglot as the load-bearing quality metric because (a) it actually discriminates, and (b) it's what I care about, agentic coding is what these boxes get used for in practice. If there's a benchmark I didn't run, it is because I don't know about it or didn't think of it.
One reading convention for every table below: higher is better, unless I explicitly say otherwise.
Quality: Strix Halo vs A30
Nobody expects a model to get smarter or dumber depending on whether it runs on AMD or NVIDIA silicon, and it doesn't. This was never really an open question. But I had both boxes and was running the benchmarks anyway, so I figured I'd confirm it and see where any real differences showed up. The answer: quality is the same within noise, with one model-specific surprise. Boring but necessary setup for everything that follows.
| Model | Strix Halo (Vulkan) | A30 (CUDA) |
|---|---|---|
| gemma-3-27b-it (HumanEval+ p@1+) | 78.7% | 77.4% |
| qwen2.5-coder-32b (HumanEval+ p@1+) | 85.4% | 86.6% |
| qwen3-30b-a3b-2507 (HumanEval+ p@1+) | 89.0% | 89.6% |
| qwen2.5-coder-32b (Polyglot) | 25/225 (11.1%) | 25/225 (11.1%) |
| qwen3.6 (Polyglot, no think) | 121/225 (53.8%) | 106/225 (47.1%) |
("p@1+" is pass@1 on EvalPlus's extended test set, meaning the model's first answer has to pass every test.)
Within noise on HumanEval+ across the board, and on the qwen2.5-coder Polyglot row. The qwen3.6 Polyglot row shows a 6.7pp cross-host gap (53.8% Strix Halo vs 47.1% A30), which is larger than I'd expect from pure sampling noise; possibly a real CUDA-vs-Vulkan difference for that specific model and harness, or a build-version skew between the two boxes. The HumanEval+ gemma/qwen2.5/qwen3-30b rows on the same model files agree exactly cross-host, so it isn't a general "the A30 produces worse logits" pattern; it's a qwen3.6-Polyglot-specific finding I'd want to dig into in a future bench.
So the model is mostly the model, and hardware doesn't make it dumber in any general sense. There can be model-specific cross-host quirks worth checking (this one came as a surprise to me), but for the typical case, once you've picked a model that fits, the hardware question reduces to how fast and can it even fit.
Quality: Local vs the Frontier
Here's where it gets fun. I'm going to split this into coding and non-coding because they behave very differently.
Coding-specific: Aider Polyglot
Polyglot is the benchmark where frontier models still have headroom, and the one that tracks "how good is this thing as a coding agent." Here's the comparison (Aider leaderboard scores for the API models, my results for local):
| Model | Polyglot pass rate | Notes |
|---|---|---|
| GPT-5 (high) | 88.0% | API |
| Gemini-2.5-Pro (32k think) | 83.1% | API |
| DeepSeek-V3.2 Reasoner | 74.2% | API (open weight ~700B, won't fit my hardware) |
| Claude Opus 4 (32k think) | 72.0% | API |
| Claude Opus 4 (no think) | 70.7% | API |
| Claude 3.7 Sonnet (32k think) | 64.9% | API |
| qwen3.6-thinking (Strix Halo) | 62.2% | local, 35B-A3B MoE |
| Claude Sonnet 4 (32k think) | 61.3% | API |
| Claude 3.7 Sonnet (no think) | 60.4% | API |
| Claude Sonnet 4 (no think) | 56.4% | API |
| qwen3.6 (Strix Halo, no think) | 53.8% | local |
| qwen3-coder-next (Strix Halo) | 47.6% | local, 80B-A3B MoE; doesn't fit on A30, see the Speed sections |
| qwen3.6 (A30, no think) | 47.1% | local |
| GPT-OSS-120B (high) | 41.8% | leaderboard score, API |
| qwen3-30b-a3b-2507 (Strix Halo) | 30.2% | local |
| qwen3-30b-a3b-2507 (A30) | 28.9% | local |
| GPT-OSS-20B-thinking (A30) | 16.9% | local |
| GPT-OSS-120B (Strix Halo, Q4_K_M) | 1.8%* | local, almost certainly broken locally, see footnote |
* The 23× gap between local gpt-oss-120B (1.8%) and the same model's API leaderboard score (41.8%) is almost certainly the reasoning_effort parameter not wiring through to llama.cpp's gpt-oss path: low/medium/high produce near-identical outputs within sampling noise. For a model whose top-line capability is its reasoning depth, a broken reasoning knob is a broken model. Full discussion in item 3 below.
(All API model scores in this table come from the Aider Polyglot leaderboard, last updated 2025-11-20. A few newer frontier releases (Google's Gemini 3 and Anthropic's Claude Opus 4.5 / 4.7) exist but haven't been scored by the Aider team yet, so they aren't represented above. The most recent Gemini and Opus variants the leaderboard does have are Gemini 2.5 Pro 32k-think at 83.1% and Claude Opus 4 32k-think at 72.0%.)
What this shows:
-
Sonnet-class is achievable locally, in both thinking and no-think modes. My best local model (
qwen3.6-thinking, a 35B-A3B MoE) sits right in the Claude Sonnet thinking band (62.2% vs Sonnet 4 thinking 61.3%). And on the apples-to-apples no-think comparison, qwen3.6 with thinking off (53.8%) is just 2.6pp under Claude Sonnet 4 no-think (56.4%); effectively tied within Polyglot's noise floor. So it's not just "Sonnet-class when allowed to think"; it's "Sonnet-class without needing to think." That second result was the bigger surprise. -
The recommendation has moved since my last post. Back then,
qwen3-coder-next(80B-A3B) was the best local-fitting coding model I had, and the explicit subject of the previous post.qwen3.6didn't exist yet. Now it does, and it's straightforwardly better: 53.8% Polyglot at thinking-off (vs qwen3-coder-next's 47.6%), 62.2% at thinking-on, smaller VRAM footprint, faster throughput. If you've been running qwen3-coder-next on Strix Halo since my last post: try qwen3.6. -
The real gap is to GPT-5, Gemini 2.5 Pro, and Claude Opus. Those three are ~10-26pp ahead of my best local model. The Anthropic ladder is worth calling out specifically: qwen3.6-thinking (62.2%) is essentially tied with Sonnet 4 thinking (61.3%), but Anthropic's actual flagship is Opus, which scores 72.0%, about 10pp ahead of local. Then GPT-5 (88.0%) and Gemini 2.5 Pro thinking (83.1%) are the real top of the leaderboard. DeepSeek V3.2 Reasoner (74.2%) is the closest open-weight to that band, but at ~700B parameters it won't fit on either of my boxes.
-
Local quants underperform their API counterparts catastrophically on some models. My local
gpt-oss-120BQ4_K_M scored 1.8%; the leaderboard'sgpt-oss-120b (high)scored 41.8%. That's a 23x gap, not a small one. Three things contribute: quantization, thereasoning_effortparameter doesn't actually wire through to the model on llama.cpp (I verified this; low/medium/high produce near-identical outputs within sampling noise), and I used Aider'swholeedit format vs the leaderboard'sdiff. The reasoning-effort issue is probably the biggest factor; gpt-oss is essentially a reasoning model, and if the reasoning depth knob is broken, the model is operating in something close to a "low effort" mode regardless of what you pass in. -
Thinking mode is meaningful when measured correctly.
qwen3.6without thinking: 53.8%. With thinking: 62.2%. That's 8.4 pp of capability sitting behind a flag.
A caveat about polyglot versioning
My test harness ran 225 exercises on most models, the same set as Aider's leaderboard. A few runs got 289 or 450 (multiple attempts per exercise from a config tweak); rates are still computed as passed/total. Edit-format matters too: I used whole because it's more robust to weaker models, while Aider's leaderboard uses diff because it gets better scores from the top models. whole is generally a slight handicap. Treat the comparisons as directional, not exact.
Methodology note: all HumanEval+ numbers in this post come from evalplus.codegen, the canonical scorer behind EvalPlus's published leaderboard.
Saturated benchmarks: HumanEval+ and lm_eval
These are the benches where frontier and local models all score in the same 85-95% range, they don't discriminate well anymore. Quick look:
Cross-host gsm8k + ifeval, identical Q4_K_M quantization, identical chat-completions API, 200 items each:
| Model | Strix Halo gsm8k | A30 gsm8k | Strix Halo ifeval | A30 ifeval |
|---|---|---|---|---|
| gemma-3-4b-it | 78.5% | 82.5% | 69.5% | 68.0% |
| qwen3-4b-2507 | 90.0% | 90.0% | 79.5% | 80.5% |
| gemma-3-12b-it | 92.5% | 91.0% | 72.0% | 72.5% |
| gemma-3-27b-it | 93.5% | 93.5% | 77.5% | 76.0% |
| qwen3-30b-a3b-2507 | 95.5% | 93.5% | 79.5% | 80.5% |
| qwen2.5-coder-32b | 95.0% | 95.0% | 75.0% | 75.0% |
| phi-4 (14B) | 90.5% | 91.0% | 56.5% | 57.0% |
| mistral-small-3.2-24b | 95.0% | 94.5% | 76.5% | 72.0% |
| qwen3.6-thinking | 96.5% | n/a (model not on A30) | 79.0% | n/a |
| gpt-oss-20b (reasoning off) | 87.5% | n/a (different run config) | 25.5% | n/a |
The cross-host rows agree within 1–2 percentage points across the board. Same model, same quantization, same prompt, same score within noise.
Notable observations:
- qwen3.6-thinking tops gsm8k at 96.5%, better than any A30 result in this set, and in the same ~95–97% band frontier models hit on saturated math benches before the frontier moved on to AIME / FrontierMath. On a 35B-A3B MoE running on a miniPC.
- gpt-oss-20b ifeval at 25.5% is shockingly low for a model that hits 87.5% on gsm8k in the same run. This is the
--reasoning offconfiguration. The other gpt-oss-20b runs in my data,reasoning onvariants, also fall in the 25–36% ifeval band, so this isn't a reasoning-flag artifact; gpt-oss-20b just struggles with strict prompt-following regardless. Worth knowing if you were planning to deploy it for instruction-bound tasks. - phi-4 inverted profile: 90.5% gsm8k but only 56.5% ifeval. It's a math-strong, instruction-weaker model. Useful data point for choosing models by use case.
Same story as the coding benches: local models are basically tied with frontier on saturated benchmarks. A Qwen3-30B-A3B on a miniPC scores 95.5% on gsm8k, comfortably in the same band as any frontier model that's still being measured against gsm8k. The frontier moat still exists, but it's on real-world agentic coding (Polyglot).
Quality at depth: NIAH
Throughput at 65K context is meaningless if the model can't actually find anything at 65K. I tested needle-in-haystack retrieval (single-needle: ask "what's the best thing to do in San Francisco?" with a sandwich-and-Dolores-Park needle planted at 10%, 50%, or 90% depth in a haystack of Paul Graham essays):
| Model | Host | Pass rate (4K/16K/32K/60K × 3 depths) |
|---|---|---|
| qwen3.6-thinking | Strix Halo | 100% (12/12) |
| qwen3-coder-next | Strix Halo | 100% (12/12) |
| qwen3-30b-a3b-2507 | A30 | 100% (12/12) |
| qwen2.5-coder-32b | A30 | 100% (12/12) |
| gemma-3-27b-it | A30 | 100% (12/12) |
| phi-4 | A30 | 100% (12/12) |
| mistral-small-3.2-24b | A30 | 100% (12/12) |
| llama-4-scout-17b-16e | A30 | 100% (12/12) |
| qwen3-4b-2507 | A30 | 100% (12/12) |
| gpt-oss-20b | A30 | 91.7% (11/12) |
| granite-3.1-8b-instruct | A30 | 88.9% (failed at depth) |
| deepseek-coder-v2-lite | A30 | 33.3% (4/12, all 4K passes, 1 of 3 at 16K, every 32K and 60K cell timed out at 600s), same root cause as the llama-bench MLA cliff: CUDA-on-MLA is too slow at depth to finish a single query inside any reasonable budget |
Top-row finding: Strix Halo's qwen3.6-thinking and qwen3-coder-next both score perfect retrieval at 60K context, with response times of 1-2 min per query. The model isn't just running with that context, it's actually using it. Combined with the throughput numbers, this is what makes the miniPC a real coding-agent target rather than a benchmark curiosity.
Speed: default throughput
Quality matters; speed matters more than people think. A 62% model running at 1 tok/s is unusable. A 50% model at 80 tok/s is a daily driver.
(Methodology note before the tables: every Strix Halo throughput number below was collected with no other model servers running, fans pinned to max, and free memory verified before each run. There's a bench wrapper now that refuses to start without those conditions met. I ended up writing it after melting the poor machine twice, details in What I'd do differently below.)
Default pp512 / tg128 numbers (Q4_K_M, -fa 1, Strix Halo on q8_0 KV / A30 on q4_0 KV, see longctx section for the protocol note). Throughput is in tokens/sec, so higher is better. The last column is the one place bigger isn't better: it's the A30/miniPC tg ratio, where above 1.0 means the A30 is faster and below 1.0 means the miniPC wins (I flag those rows inline).
| Model | Strix Halo pp | A30 pp | Strix Halo tg | A30 tg | A30/Strix Halo tg ratio |
|---|---|---|---|---|---|
| qwen3-4b-2507 | 2048 | 3934 | 75.0 | 118.9 | 1.59x |
| gemma-3-4b-it | 2257 | 4431 | 74.1 | 109.5 | 1.48x |
| gemma-3-12b-it | 750 | 1590 | 26.9 | 52.4 | 1.95x |
| phi-4 (14B) | 652 | 1452 | 24.1 | 53.2 | 2.21x |
| gpt-oss-20b | 1287 | 2805 | 80.8 | 130.2 | 1.61x |
| mistral-small-3.2-24b | 267 | 905 | 15.3 | 34.5 | 2.26x |
| gemma-3-27b-it | 230 | 771 | 12.6 | 28.0 | 2.23x |
| qwen3-30b-a3b-2507 (MoE) | 1167 | 2274 | 87.0 | 136.2 | 1.56x |
| qwen2.5-coder-32b | 186 | 633 | 11.1 | 24.1 | 2.18x |
| qwen3-coder-next (80B-A3B)† | 551 | 110 | 56.4 | 12.2 | 0.22x ← Strix Halo wins 4.6x |
| qwen3.6 (35B-A3B MoE) | 944 | 1933 | 67.1 | 99.9 | 1.49x |
† The A30 row for qwen3-coder-next is hybrid GPU/CPU offload (22 of 49 layers on GPU, the rest on CPU/RAM). The 45 GiB Q4_K_M model can't fit fully in 24 GiB VRAM, so this is what you get if you force it onto the A30 anyway, the apples-to-apples speed cost of exceeding the VRAM ceiling on a dedicated GPU.
Two stories here:
1. A30 wins at default by 2-3x. Expected, a dedicated GPU with proper VRAM and CUDA kernels should beat an APU running Vulkan. The factor is consistent across dense models in the 2.2-2.8x range.
2. MoE narrows the gap and makes the miniPC viable. Look at qwen3-30b-a3b-2507: A30/Strix Halo ratio is just 1.56x for tg, the smallest gap in the table among the bigger models. That's because the model only activates ~3B params per token. Memory bandwidth matters more than raw compute for tg, and Strix Halo's UMA gives it surprisingly good bandwidth for active-parameter-light workloads. (The 4B models also show ratios below 2x, small models stop benefiting from the A30's compute headroom because they're already bandwidth-bound on both boxes.)
Compare that to the dense qwen2.5-coder-32b: 11.1 tok/s on Strix Halo vs 24.1 on A30, still a 2.18x gap but the absolute number is terrible on Strix Halo. I don't know about the rest of you, but 11 tok/s on a 32B dense model is not exactly what I'd call "usable". I'd never reach for the dense coder if a comparable-quality MoE exists.
Speed at long context
Now the fun part. Wait, I already said that. Another fun part! Coding agents send long context (the codebase, the test results, previous turns), so what happens when you push the depth?
I ran the same pp512 / tg128 test at depths 0 / 8K / 32K / 65K. Strix Halo is benched with q8_0 KV cache (matches how the production llama-servers are deployed). A30's previously-collected longctx sweep was at q4_0 KV; the small protocol asymmetry is mildly conservative for Strix Halo at depth (q4_0 saves a bit of KV bandwidth at the cost of dequant overhead, within MC noise on this hardware, but if anything it shaves a few percent off the Strix Halo side at deep contexts).
gpt-oss-20b (fits both)
| Depth | Strix Halo pp | A30 pp | Strix Halo tg | A30 tg | A30/Strix Halo tg |
|---|---|---|---|---|---|
| 0 (default) | 1287 | 2805 | 80.8 | 130.2 | 1.61x |
| 8K | 958 | 2522 | 66.6 | 109.5 | 1.64x |
| 32K | 547 | 1933 | 56.9 | 77.2 | 1.36x |
| 65K | 338 | 1452 | 45.6 | 54.5 | 1.20x |
A30 tg dropped 58% from default to 65K depth (130 to 55 tok/s). Strix Halo tg dropped 44% over the same range (81 to 46 tok/s). A30 still wins on this model at every depth, but the lead shrinks dramatically as context grows, the A30/Strix Halo ratio compresses from 1.61x at default to 1.20x at 65K.
qwen3-30b-a3b-2507 (fits both)
| Depth | Strix Halo pp | A30 pp | Strix Halo tg | A30 tg | A30/Strix Halo tg |
|---|---|---|---|---|---|
| 0 (default) | 1167 | 2274 | 87.0 | 136.2 | 1.56x |
| 8K | 533 | 1746 | 62.1 | 72.5 | 1.17x |
| 32K | 205 | 1012 | 40.2 | 35.1 | 0.87x ← Strix Halo wins |
| 65K | 110 | 631 | 28.1 | 20.6 | 0.73x ← Strix Halo wins by 36% |
This is where it gets spicy. A30 tg dropped 85% from default to 65K (136 to 21 tok/s), the 24 GiB VRAM ran out of room for a meaningful KV cache at depth. Strix Halo tg dropped 68% over the same range (87 to 28 tok/s), painful but consistent. Crossover happens between 8K and 32K context. At 32K the miniPC is already faster; at 65K it's 36% faster than the dedicated GPU.
The model itself is 17 GiB Q4_K_M. The A30 has 24 GiB of VRAM. At 65K context the KV cache plus activations plus the model are competing for that 7 GiB headroom, and CUDA's memory management gets bottlenecked. Strix Halo's 128 GiB UMA doesn't care, there's so much memory headroom that the only constraint is compute and bandwidth, both of which degrade gracefully.
qwen3.6-thinking (Strix Halo only, Sonnet-tier model)
This is the model I'd actually use for coding. The numbers are remarkable:
| Depth | Strix Halo pp | Strix Halo tg |
|---|---|---|
| 0 (default) | 944 | 67.1 |
| 8K | 790 | 61.8 |
| 32K | 517 | 55.6 |
| 65K | 349 | 45.5 |
tg drops 32% from default to 65K depth (67.1 to 45.5 tok/s). A Sonnet-class model running locally at 45 tok/s with a 65K-token context window. That's actually usable for serious agentic coding, you can pack a meaningful chunk of a codebase into the context and not pay a brutal speed tax for it.
A note on Q8_0: I also ran the no-think qwen3.6 at Q8_0 (38 GiB on disk vs Q4_K_M's 20 GiB). Polyglot moved from 53.8% to 56.9%, a ~3 pp gain. Throughput dropped from 65 tok/s to 50 tok/s at default and is similarly proportional at depth. So if you have the disk and want every last point of Polyglot, Q8_0 is a real upgrade. If you'd rather have the speed, Q4_K_M is the right call, the quality gap is small relative to the speed cost.
qwen3-coder-next 80B-A3B (Strix Halo UMA vs A30 hybrid offload)
The 80B-A3B that motivated the last post. At 45 GiB Q4_K_M it doesn't fit in 24 GiB VRAM, so the A30 column here is hybrid GPU/CPU offload (-ngl 22, 22 of 49 layers on GPU, the rest streamed from system RAM). Strix Halo's 128 GiB UMA swallows the full model and runs entirely on the iGPU:
| Depth | Strix Halo pp | A30 hybrid pp | Strix Halo tg | A30 hybrid tg | A30/Strix Halo tg |
|---|---|---|---|---|---|
| 0 (default) | 551 | 110 | 56.4 | 12.2 | 0.22x ← Strix Halo wins 4.6x |
| 8K | 500 | 109 | 52.5 | 9.3 | 0.18x ← Strix Halo wins 5.6x |
| 32K | 372 | 109 | 46.9 | 5.4 | 0.12x ← Strix Halo wins 8.7x |
| 65K | 256 | 106 | 38.1 | 3.9 | 0.10x ← Strix Halo wins 9.8x |
This is the clearest "wrong tool for the job" result I had. The A30 is a good card, it just doesn't have enough VRAM to hold the model, and PCIe bandwidth between GPU and host RAM is roughly 30x slower than the A30's own HBM2. So every token has to drag activations across that bottleneck.
The math: A30 hybrid tg falls from 12.2 to 3.9 tok/s (a 68% drop) over the depth sweep, while Strix Halo's UMA tg falls from 56.4 to 38.1 (only 32%). The A30 falls off twice as steeply because attention has to read the full KV cache to produce each new token, and on hybrid mode roughly half the model's layers, plus their slice of the KV cache, live in CPU RAM (DDR4, on this server). Each token's attention op pays PCIe-bandwidth overhead, and that overhead scales with context length. So 4.6× at default and 9.8× at 65K.
On the Strix system the story is the other way around: the iGPU has the same bandwidth to all 128 GiB as it does to the first 16 GiB. There's no VRAM cliff to fall off because there's no VRAM/RAM distinction at all. tg drops 32% from default to 65K (56.4 to 38.1 tok/s), painful but consistent, and at 38 tok/s with 65K of context loaded it's still... not fast, but usable.
(I also tried to run Aider Polyglot on A30 hybrid for a quality cross-check; the harness's per-call timeout repeatedly fired against the 3.9–9.3 tok/s hybrid response rate, and I abandoned the run after 9 of 225 exercises in ~5 hours. Throughput data above is from llama-bench directly, which doesn't have that problem.)
DeepSeek-Coder-V2-Lite, the bonus weird result
I benchmarked this one for completeness, expecting nothing exciting. Instead I found one of the clearest "the dedicated GPU is broken here" results in the whole sweep. DeepSeek-V2's Multi-head Latent Attention (MLA) uses a low-rank-projected KV cache that's smaller than standard MHA but requires a different attention kernel. The CUDA implementation in llama.cpp build 9064 falls off a cliff once any KV is present:
| Depth | Strix Halo pp (Vulkan) | A30 pp (CUDA) | Strix Halo tg | A30 tg | A30/Strix Halo tg |
|---|---|---|---|---|---|
| 0 (default) | 1641 | 408 | 106.0 | 88.5 | 0.83x ← Strix Halo wins |
| 8K | 1032 | 17 | 64.4 | 4.9 | 0.08x ← Strix Halo wins 13x |
| 32K | 484 | wedged | 30.8 | wedged | n/a |
| 65K | 250 | wedged | 17.4 | wedged | n/a |
The A30 bench actually wedged my harness, at d=32K, the CUDA kernel grinds at ~3-5 tok/s prefill, which means a single measurement of the 32K-token prefill would take 100+ minutes. I killed it after 17 minutes of no progress.
Strix Halo's Vulkan path handles MLA at depth normally, degrading from 106 to 17 tok/s tg is a real cliff, but it's a finite one and the bench actually finishes. Even at d=0 Strix Halo is 4× faster on pp512 (1641 vs 408), and that's before any KV is in play. The CUDA backend isn't just slow at depth on this architecture, it's just slow on this architecture.
This isn't a hardware issue, I don't think, it's a software bug. Presumably to be fixed in some future llama.cpp release lol. But for anyone considering DeepSeek-V2-family models for coding right now the miniPC is the only sensible target. A 24 GiB A30 will load the model just fine and then be fairly unusable.
Bonus models (Strix Halo coverage only)
For completeness, three more models I benchmarked on Strix Halo to fill out the table:
| Model | Default pp | Default tg | 65K pp | 65K tg | Note |
|---|---|---|---|---|---|
| granite-3.1-8b-instruct | 996 | 39.5 | (crashed) | (crashed) | Vulkan device-lost at d=65K, got d=0/8K/32K only |
| llama-4-scout-17b-16e | 159 | 20.1 | 105 | 13.9 | 17B-active, 109B-total, slowest in the post but flattest depth scaling (only 31% tg drop) |
What each box is actually best for
- Strix Halo as a coding agent: qwen3.6 with thinking on when I want quality, qwen3.6 with thinking off when I want speed/quality balance. Same model file, same throughput, just flip the
--reasoningflag. - A30 for serving small concurrent requests: gpt-oss-20b at 130 tok/s or qwen3-30b-a3b at 136 tok/s is great for embeddings, rerank, and utility models in a stack.
These are different jobs. The boxes aren't substitutes; they're complements.
What I'd do differently
-
Update everything to the latest first. I spent a week chasing scores that looked too low only to realize my llama.cpp was 700 commits behind on reasoning-channel handling. Thinking models scored 0% on lm_eval because the reasoning content was consuming the entire context budget. A rebuild fixed it. This stuff moves fast, llama.cpp lands fixes weekly, so pull and rebuild to the latest before you trust a single number.
-
Bench with
-dfrom the start, not-c. The-carg got removed from llama-bench in recent builds; the replacement is-dfor testing tg at a given KV depth. My first A30 long-context sweep died at parse time. Trivial fix in retrospect, but it cost me half a day. -
Don't trust HumanEval+ as a discriminator. Everything competent scores 85%+. The bench doesn't separate "okay" from "great." Polyglot is what actually matters; I should have run it first.
-
Run
wholeanddiffedit formats both. I ran everything inwholebecause it's robust for weak models. That makes the strong-model comparisons against Aider's leaderboard (which usesdiff) slightly unfair to the local models. Doing both would have given a cleaner local-vs-API comparison. -
Treat thermals and bench cleanliness as first-class concerns. Two specific traps cost me roughly a week of redo work:
- Don't re-make the same thermal mistakes as last time. I already worked this box's thermals out in the last post: sustained GPU load trips it unless you cap power with
ryzenadjand pin the fans manually, because the stock fan curve is tuned for desktop bursts, not back-to-back benchmarks holding the GPU near 100% for minutes at a time. Then I forgot to actually turn any of that on before kicking off a multi-hour sweep, and crashed the box twice (no kernel log, just unreachable until a power-cycle) rediscovering a lesson I'd already written down. The fix was the one I already had on the shelf:mode=fixed level=5on all three fans (under/sys/class/ec_su_axb35/fan*/) before any sustained workload. The wrapper now refuses to start a bench unless the fans are confirmed above 3500 RPM. - Keep other model servers cleared out the whole time, not just at the start. Any concurrent
llama-serverprocess--mlock's its model into RAM and steals memory bandwidth from the bench. I caught this when a spot-check tg128 re-run came in 5% higher than the recorded number with everything else stopped. Five percent is small enough to miss in a single run and big enough to materially change rankings across models. The real trap is that it's easy to start clean and then let stray servers creep back in over a long session, so the fix isn't a one-time cleanup, it's re-verifying nothing else is loaded before every single run. Every Strix Halo throughput number in this post was collected that way, and the wrapper enforces it as a precondition.
The meta lesson: a bench harness that requires you to remember the discipline will eventually run dirty. Make the harness refuse to run unless the conditions are met.
The end result
Use the right tool for the job. Shocking, I know.
The miniPC can be a Sonnet-tier coding agent (when running the right model) that costs about $2,500 once and never sends my code anywhere. The A30 box is for smaller task-specific models that need high throughput.
The local-vs-frontier gap is still real on the hardest problems and on real agentic Polyglot work, but it's roughly Sonnet-class for daily-driver coding tasks, and the gap is closing. The next time someone benchmarks this, I expect the frontier-API moat to be at least a little bit smaller.
*Footnote: the
reasoning_effortparameter not wiring through to llama.cpp's gpt-oss path is documented elsewhere; I verified by running effort=low/medium/high through lm_eval gsm8k and getting near-identical scores (90% / 86% / 86%), within the sampling noise band for a 200-item subset. If the flag were actually doing anything, I'd expect monotonic improvement from low to high; instead "high" is the same as "medium" and "low" comes out higher than both, which only makes sense if all three are effectively the same configuration plus sampling noise. A separate post about this might be coming.* - Don't re-make the same thermal mistakes as last time. I already worked this box's thermals out in the last post: sustained GPU load trips it unless you cap power with