<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Damen Knight</title>
    <link>https://damenknight.com/</link>
    <description>Astrophotography, Tech Stuff, More</description>
    <language>en-us</language>
    <copyright>© 2026 Damen Knight. All rights reserved.</copyright>
    <atom:link href="https://damenknight.com/feed.xml" rel="self" type="application/rss+xml" />
    <lastBuildDate>Sat, 30 May 2026 12:00:00 GMT</lastBuildDate>
    <item>
      <title>MTP Speculative Decoding on Strix Halo: How I Made It 3x Slower Before I Made It Faster</title>
      <link>https://damenknight.com/mtp-speculative-decoding-strix-halo/</link>
      <guid isPermaLink="true">https://damenknight.com/mtp-speculative-decoding-strix-halo/</guid>
      <pubDate>Sat, 30 May 2026 12:00:00 GMT</pubDate>
      <category>Homelab</category>
      <category>Projects</category>
      <description>A short follow-up on the Strix Halo miniPC: turning on MTP speculative decoding with its default settings made qwen3.6 generation 3x slower. Tuned for an A3B…</description>
      <content:encoded><![CDATA[<p>In <a href="https://damenknight.com/strix-halo-vs-a30-vs-frontier/">the last post</a> I landed on qwen3.6 as the most usable coding model I could actually run on the miniPC (a Strix Halo box: AMD Ryzen AI MAX+ 395, 128 GiB of unified memory, Vulkan). This is a MUCH shorter follow-up about squeezing more tokens/sec out of it with MTP, and about the ditch I drove into on the way.</p>

<p>The three-sentence version: turned on with its default settings, MTP made generation <em>3x slower</em>! Tuned for this model, it’s about 18–26% faster. The difference between the two is a single number.</p>

<div class="toc" style="background-color: var(--color-bg-raised); border: 1px solid var(--color-border); border-left: 3px solid var(--color-accent); border-radius: 10px; padding: 1.5rem 2rem; margin-bottom: 2.5rem;">
<p style="font-family: var(--font-mono); font-size: 0.8125rem; font-weight: 600; color: var(--color-accent); text-transform: uppercase; letter-spacing: 0.1em; margin-bottom: 0.75rem;">Contents</p>
<ul style="list-style: none; padding: 0; margin: 0;">
<li style="margin-bottom: 0.35rem; "><a href="#what-is-mtp" style="font-size: 0.9375rem; text-decoration: none;">What the heck is MTP??</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#the-wrong-path" style="font-size: 0.9375rem; text-decoration: none;">The wrong path</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#the-knob-that-matters" style="font-size: 0.9375rem; text-decoration: none;">The knob that matters: --spec-draft-n-max</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#the-config-id-use" style="font-size: 0.9375rem; text-decoration: none;">The config I'd actually use</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#the-iq4-xs-build" style="font-size: 0.9375rem; text-decoration: none;">What about the faster IQ4_XS build?</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#strix-halo-vulkan-result" style="font-size: 0.9375rem; text-decoration: none;">This is a Strix Halo / Vulkan result</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#the-takeaway" style="font-size: 0.9375rem; text-decoration: none;">The takeaway</a></li>
</ul>
</div>

<h2 id="what-is-mtp">What the heck is MTP??</h2>

<p>Normally a model generates one token per forward pass: run the whole network, get one token, repeat. That’s slow. Speculative decoding speeds it up by first having a small, fast predictor — something far cheaper to run than the full model — guess the next several tokens, and then letting the full model verify that whole batch of guesses in a <em>single</em> forward pass. The trick is that checking several tokens at once costs the big model about the same as generating one. Every guess it accepts is a token you got essentially for free. The catch, of course, is that every <em>rejected</em> guess is wasted compute, both the draft’s and the verify’s. The math only pays off if the guesses are accepted often.</p>

<p>MTP (Multi-Token Prediction) is the self-speculative version: instead of running a second small “draft” model alongside the big one, the model ships with an extra lightweight head trained to predict a few tokens ahead. The draft and the verify come from the same model. llama.cpp added support in <a href="https://github.com/ggml-org/llama.cpp/pull/22673">PR #22673</a> (merged 2026-05-16, build b9180 or later), exposed as <code>--spec-type draft-mtp</code>.</p>

<p>Qwen3.6 here is a 35B-A3B Mixture-of-Experts model: 35B total parameters, but only ~3B are active per token. That “A3B” part turns out to matter a lot for whether MTP helps.</p>

<h2 id="the-wrong-path">The wrong path</h2>

<p>The first thing I did, which was the first thing and not the smart thing, was the naive run: flip MTP on, leave everything at defaults, and see what stock settings buy you. It’s what most people will reach for, and the articles all quote ~2x, so why not? Generation dropped to 18.6 tok/s, down from ~60 with MTP off. So, the opposite of 2x. Definitely didn’t seem right lol.</p>

<p>The culprit was <code>--spec-draft-n-max</code>, the number of tokens the head is allowed to draft ahead before the model checks its work. It defaults to <em>16</em>. Here’s what it gets you at a range of values (tok/s, higher is better):</p>

<table>
<thead>
<tr>
<th>MTP config (qwen3.6-35B-A3B, Unsloth Q4_K_M — early run)</th>
<th>tok/s</th>
</tr>
</thead>
<tbody>
<tr>
<td>off (baseline)</td>
<td>59.7</td>
</tr>
<tr>
<td>on, <code>--spec-draft-n-max 16</code> (the default)</td>
<td>18.6</td>
</tr>
<tr>
<td>on, <code>--spec-draft-n-max 8</code></td>
<td>27.6</td>
</tr>
<tr>
<td>on, <code>--spec-draft-n-max 4</code></td>
<td>66.9</td>
</tr>
</tbody>
</table>

<p>A caveat on these numbers: they were quick and exploratory. I killed off any stray model servers but skipped the full services-down, fans-pinned protocol I used for the recommended config below, so read the absolute baseline loosely. (It also idles at ~60 here rather than the ~65 you’ll see later because this early run used a slightly slower quant upload. Why two “Q4_K_M” files run at different speeds is its own rabbit hole, maybe a future companion post.) A 3x regression dwarfs either effect, which is the whole point.</p>

<p>So there we have it: for me, the default was a roughly 3x regression, and the line between useless and useful is narrow. n=8 is still slower than no MTP at all, and only by n=4 does it pull ahead. I almost wrote the whole thing up as “MTP doesn’t work on Strix Halo MoE” right there. Then I did the thing I should’ve led with if I’d actually wanted good numbers: went and read how MTP works.</p>

<h2 id="the-knob-that-matters">The knob that matters: --spec-draft-n-max</h2>

<p>The default of 16 is calibrated for dense, instruction-tuned models, which accept long draft runs. An A3B MoE is the opposite: only ~3B parameters fire per token, the head’s predictions get rejected sooner, and every rejected draft past the acceptance point is pure waste. The community guidance for A3B-class MoEs converges on <em>n=2 or 3</em>. Even llama.cpp’s own MTP pull request reports its best results around 3 draft tokens at roughly 75% steady-state acceptance, nowhere near the default 16, and acceptance only falls faster as you push past that.</p>

<p>The other thing worth saying plainly: the reported 2x speedup you may have seen for “Qwen3.6 + MTP” comes from other setups (the PR author quotes >2x, testing on a different stack than a power-capped Strix Halo). On this box, for the 35B-A3B MoE on Vulkan, the realistic ceiling I measured is more like 1.2x. Lower acceptance means less free speed. That’s not a failure, it’s just the honest number for this class of model on this hardware, and it’s still worth having.</p>

<p>There’s also a second knob, <code>--spec-draft-p-min</code> (the minimum probability the head needs before a draft is even attempted, default 0.75). Some guides call it the most impactful parameter. On my hardware, sweeping it from 0.5 to 0.9 stayed within sampling noise, so I left it at the default. Your mileage may vary; it’s worth a quick sweep, but n_max is the one that actually moved my numbers.</p>

<h2 id="the-config-id-use">The config I'd actually use</h2>

<p>Two setups: pick by whether you care more about disk or about the last drop of quality. Both are gfx1151 on Vulkan (see the caveat at the end if you’re on something else), same clean-bench methodology as the last post.</p>

<p><strong>The build.</strong> You need llama.cpp with <a href="https://github.com/ggml-org/llama.cpp/pull/22673">PR #22673</a>, so b9180 or later. Check with <code>llama-server --help</code> and look for <code>draft-mtp</code> in the <code>--spec-type</code> modes. Build it Vulkan-only: there’s a known foot-gun (<a href="https://github.com/ggml-org/llama.cpp/issues/23199">issue #23199</a>) where, in a dual Vulkan+ROCm build, the MTP tensors get placed on the ROCm device and MTP is silently disabled even when you asked for Vulkan. I didn’t run into this thankfully (my build is Vulkan-only), but watch out.</p>

<p><strong>Q4_K_M (smaller, faster, what I’d default to).</strong> Two files from <a href="https://huggingface.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF">bartowski/Qwen_Qwen3.6-35B-A3B-GGUF</a>: the model (<code>Qwen_Qwen3.6-35B-A3B-Q4_K_M.gguf</code>, ~20 GiB) and the MTP head as a separate file (<code>mtp-Qwen_Qwen3.6-35B-A3B-Q4_0.gguf</code>, ~1 GiB).</p>

<pre><code>llama-server \
  --model Qwen_Qwen3.6-35B-A3B-Q4_K_M.gguf \
  --model-draft mtp-Qwen_Qwen3.6-35B-A3B-Q4_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ngl 999 -c 8192 -fa on --parallel 1 \
  -t 32 -tb 32 -ub 2048 \
  -ctk q8_0 -ctv q8_0 \
  --reasoning off \
  --host 0.0.0.0 --port 8089 --alias qwen3.6</code></pre>

<p>That gets ~77 tok/s, versus ~65 with the <code>--model-draft</code> and <code>--spec-type</code> lines removed. About +18%.</p>

<p><strong>Q8_0 (more disk, a hair more quality).</strong> Same flags, the Q8_0 trunk and head (<code>Qwen_Qwen3.6-35B-A3B-Q8_0.gguf</code> ~35 GiB, <code>mtp-Qwen_Qwen3.6-35B-A3B-Q8_0.gguf</code> ~2 GiB), and one important change: <code>--spec-draft-n-max 2</code> instead of 3. Expected: ~63 tok/s, versus ~50 with MTP off. About +26%.</p>

<p>The sweet spot drops to 2 at Q8 because the trunk’s own predictions are more confident, so it accepts fewer drafts from the head, and drafting a third token just wastes work. Sweep n_max (1, 2, 3, 4) on whatever model and quant combo you land on and use whichever wins.</p>

<p><strong>The head quant matters less than you’d think, and lighter is better.</strong> A Q4_0 head on the Q4_K_M trunk hit 77 tok/s; a Q8_0 head on the same trunk did 74. The bigger head’s better predictions don’t pay for the extra bandwidth they cost.</p>

<p><strong>Configs that lose, for completeness:</strong> the default <code>--spec-draft-n-max 16</code> (~19 tok/s, a 3x regression) and <code>--spec-draft-n-max 8</code> (~28 tok/s, still slower than no MTP). The default is the trap.</p>

<p>If you’d rather not juggle two files, <a href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF">unsloth/Qwen3.6-35B-A3B-MTP-GGUF</a> bundles the trunk and head into one (drop <code>--model-draft</code>, keep <code>--spec-type draft-mtp</code>). I measured ~75 tok/s at n=3, about 2 tok/s behind the bartowski split, with identical quality on Polyglot, gsm8k, and ifeval. Use whichever workflow you prefer.</p>

<h2 id="the-iq4-xs-build">What about the faster IQ4_XS build?</h2>

<p>There’s an <a href="https://huggingface.co/localweights/Qwen3.6-35B-A3B-MTP-IQ4_XS-GGUF">IQ4_XS build with the MTP head baked in</a>, and a pretty wild-sounding claim for IQ4_XS+MTP on Strix Halo was making the rounds: 90.8 tok/s average, 110.6 peak. IQ4_XS is a smaller quant than Q4_K_M (about 4.25 bits per weight versus ~4.8), and since generation is bandwidth-bound, smaller CAN mean faster, so it’s a plausible claim. I downloaded it and benched it the same way as everything above, but I was not able to reproduce it.</p>

<table>
<thead>
<tr>
<th>IQ4_XS + MTP (n=2), my box</th>
<th>tok/s</th>
</tr>
</thead>
<tbody>
<tr>
<td>q8_0 KV cache</td>
<td>~78</td>
</tr>
<tr>
<td>f16 KV cache</td>
<td>~81</td>
</tr>
<tr>
<td>the number I was chasing</td>
<td>90.8 avg / 110.6 peak</td>
</tr>
</tbody>
</table>

<p>Apples-to-apples, at the q8_0 KV cache my recommended config uses, IQ4_XS lands around 78 tok/s, a hair over the Q4_K_M + MTP setup above (~77) but inside the noise. Switching IQ4_XS to an f16 KV cache pushes it to ~81. That is a real few percent, but it is the cache talking, not the quant: f16 would lift the Q4_K_M numbers the same way. So IQ4_XS earns you a little (a smaller file), and a little more if you spend the extra memory on an f16 cache, but it is not the different league the 90+ figure implies. A raw <code>llama-bench</code> pass on the file came in at 73, so that is not where the headline comes from either.</p>

<p>My best guess for the gap to 90.8: I cap this box at 100W with <code>ryzenadj</code> for round-the-clock thermal stability (the “I melted it twice” saga from the last post). Run the chip hotter, or on a newer build, and you would probably claw some of it back. Worth a shot if you have the thermal headroom. But on a power-limited Strix Halo it is a modest step over the Q4_K_M + MTP config, not a leap, and nothing I measured got close to 90. There are some thermal upgrades I may attempt to make, and maybe I’ll revisit this with a higher cap if I do.</p>

<h2 id="strix-halo-vulkan-result">This is a Strix Halo / Vulkan result</h2>

<p>The numbers above are gfx1151 on Vulkan. On CUDA, an early community writeup for this same model found no net speedup from llama.cpp’s speculative-decoding paths: it tested <a href="https://github.com/thc1006/qwen3.6-speculative-decoding-rtx3090">19 configurations on an RTX 3090</a> and found none faster than baseline (the same author’s <a href="https://hackmd.io/ODXuOQNzSiyUITz7g9mtBw">HackMD notes</a> lay out the detail). Note that’s a llama.cpp-specific result, the same author found vLLM’s MTP faster on the same card. So the speedup here may be specific to the Vulkan path or to the unified-memory architecture. If you’re on a 3090 or an A100, don’t expect these numbers, and if you measure your own, publish them: there’s a bit of public A3B-MoE spec-decode data now (mostly CUDA, plus a Strix Halo ROCm run or two), but I couldn’t find a single A3B-on-gfx1151-via-Vulkan benchmark out there, so that corner is wide open.</p>

<h2 id="the-takeaway">The takeaway</h2>

<p>MTP on this hardware is a real but modest gain: ~18–26%, not the 2x you’ll see quoted for the dense model. And the default config is actively harmful on an A3B MoE. If you take one thing from this: turn <code>--spec-draft-n-max</code> down to 2 or 3 before you decide whether MTP works for you.</p>]]></content:encoded>
    </item>
    <item>
      <title>Strix Halo vs an A30 vs the Frontier: What the miniPC Can (and Can&#x27;t) Actually Do</title>
      <link>https://damenknight.com/strix-halo-vs-a30-vs-frontier/</link>
      <guid isPermaLink="true">https://damenknight.com/strix-halo-vs-a30-vs-frontier/</guid>
      <pubDate>Sat, 23 May 2026 12:00:00 GMT</pubDate>
      <category>Homelab</category>
      <category>Projects</category>
      <description>I benchmarked my Strix Halo miniPC against an A30 GPU and frontier API models. The short version: locally-hosted qwen3.6-thinking scores 62.2% on the Aider…</description>
      <content:encoded><![CDATA[<p>A few months ago <a href="https://damenknight.com/running-frontier-coding-model-mini-pc/">I wrote about getting a Strix Halo miniPC (~$2.5K all-in) to run a frontier coding model</a>. That led to people asking me how I was measuring/testing models (I really wasn't, beyond what my anecdotal experience was).  That led to me starting to actually benchmark stuff - but then I thought... why not throw my A30 GPU in the mix while I'm at it, and really see what's what?  They even cost me (roughly) the same.</p>

<p>The short version: I didn't lie! It is roughly like Sonnet when it's a bit slow.  BUT, things have changed and now you should use qwen3.6.  Also there are some other neat findings ahead ;)</p>

<p>The longer version: quality is hardware-agnostic, which is exactly what you'd expect (run the same model on a Strix Halo APU or a datacenter A30 and you get the same benchmark score within noise, because the hardware should decide how fast, not how smart). The interesting questions were always about speed and fit, and that is where the miniPC shines!</p>

<p>Locally-hosted <code>qwen3.6-thinking</code> (a 35B-A3B Mixture-of-Experts (MoE) model, that is, 35B total parameters with only ~3B activated per token) scores 62.2% on the <a href="https://aider.chat/2024/12/21/polyglot.html">Aider Polyglot benchmark</a>, sitting right between Claude Sonnet 4 thinking (61.3%) and Claude 3.7 Sonnet thinking (64.9%), and it does that at 45 tok/s with a 65K-token context. That is Sonnet-class capability, at usable speeds, on a ~$2.5K box. It is not Opus-class (72.0%) or GPT-5-class (88.0%), but Sonnet-class coding on hardware I own, at no per-token cost and with nothing leaving my network.</p>

<p>The surprises were both about memory, not compute. First: even when a model fits comfortably in the A30's VRAM, the A30 can still lose at long context. The A30 wins at short context, but on qwen3-30b-a3b at 65K depth the miniPC outruns it by 36% (28.1 vs 20.6 tok/s), because the A30's 24 GiB runs out of room for the KV cache (the attention key/value store that grows with every token of context) while the miniPC's 128 GiB of unified memory shrugs it off (the crossover lands somewhere between 8K and 32K). qwen3-30b-a3b is only 17 GiB, so this isn't even an offload problem, it's all running on the GPU, the A30 just can't hold a big model and a big KV cache at once.</p>

<p>The bigger surprise was the hybrid cliff. For any model that doesn't fit in 24 GiB at all, the miniPC wins outright. qwen3-coder-next (80B-A3B) has to be split across GPU and CPU on the A30, and once half its layers (plus their KV) live in slow DDR4 system memory and get streamed over PCIe every token, it crawls: 3.9 tok/s at 65K against Strix Halo's 38, almost 10x. That gap, a unified-memory box versus a GPU forced into hybrid offload, is the whole reason I bought the thing.</p>

<p>The recommendation has also moved since the last post. Back then, <code>qwen3-coder-next</code> was the best local-fitting coding model I had, and I called its quality "something like Sonnet 4.5 on a slow day." That was ... close enough that I won't say I was wrong, hah (it lands about 9pp below baseline Sonnet 4), but qwen3.6 has since released, and on the same Strix Halo hardware it is straightforwardly better: higher quality, faster, and a smaller VRAM footprint. If you have been running qwen3-coder-next since my last post, switch to qwen3.6.</p>

<div class="toc" style="background-color: var(--color-bg-raised); border: 1px solid var(--color-border); border-left: 3px solid var(--color-accent); border-radius: 10px; padding: 1.5rem 2rem; margin-bottom: 2.5rem;">
<p style="font-family: var(--font-mono); font-size: 0.8125rem; font-weight: 600; color: var(--color-accent); text-transform: uppercase; letter-spacing: 0.1em; margin-bottom: 0.75rem;">Contents</p>
<ul style="list-style: none; padding: 0; margin: 0;">
<li style="margin-bottom: 0.35rem; "><a href="#the-contestants" style="font-size: 0.9375rem; text-decoration: none;">The contestants</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#the-benchmarks" style="font-size: 0.9375rem; text-decoration: none;">The benchmarks</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#quality-strix-halo-vs-a30" style="font-size: 0.9375rem; text-decoration: none;">Quality: Strix Halo vs A30</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#quality-local-vs-the-frontier" style="font-size: 0.9375rem; text-decoration: none;">Quality: Local vs the Frontier</a></li>
<li style="margin-bottom: 0.35rem; padding-left: 1.25rem;"><a href="#coding-specific-aider-polyglot" style="font-size: 0.875rem; text-decoration: none;">Coding-specific: Aider Polyglot</a></li>
<li style="margin-bottom: 0.35rem; padding-left: 1.25rem;"><a href="#saturated-benchmarks-humaneval-and-lmeval" style="font-size: 0.875rem; text-decoration: none;">Saturated benchmarks: HumanEval+ and lm_eval</a></li>
<li style="margin-bottom: 0.35rem; padding-left: 1.25rem;"><a href="#quality-at-depth-niah" style="font-size: 0.875rem; text-decoration: none;">Quality at depth: NIAH</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#speed-default-throughput" style="font-size: 0.9375rem; text-decoration: none;">Speed: default throughput</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#speed-at-long-context" style="font-size: 0.9375rem; text-decoration: none;">Speed at long context</a></li>
<li style="margin-bottom: 0.35rem; padding-left: 1.25rem;"><a href="#gpt-oss-20b-fits-both" style="font-size: 0.875rem; text-decoration: none;">gpt-oss-20b (fits both)</a></li>
<li style="margin-bottom: 0.35rem; padding-left: 1.25rem;"><a href="#qwen3-30b-a3b-2507-fits-both" style="font-size: 0.875rem; text-decoration: none;">qwen3-30b-a3b-2507 (fits both)</a></li>
<li style="margin-bottom: 0.35rem; padding-left: 1.25rem;"><a href="#qwen36-thinking-strix-halo-only-sonnet-tier-model" style="font-size: 0.875rem; text-decoration: none;">qwen3.6-thinking (Strix Halo only, Sonnet-tier model)</a></li>
<li style="margin-bottom: 0.35rem; padding-left: 1.25rem;"><a href="#qwen3-coder-next-80b-a3b-strix-halo-uma-vs-a30-hybrid-offload" style="font-size: 0.875rem; text-decoration: none;">qwen3-coder-next 80B-A3B (Strix Halo UMA vs A30 hybrid offload)</a></li>
<li style="margin-bottom: 0.35rem; padding-left: 1.25rem;"><a href="#deepseek-coder-v2-lite-the-bonus-weird-result" style="font-size: 0.875rem; text-decoration: none;">DeepSeek-Coder-V2-Lite, the bonus weird result</a></li>
<li style="margin-bottom: 0.35rem; padding-left: 1.25rem;"><a href="#bonus-models-strix-halo-coverage-only" style="font-size: 0.875rem; text-decoration: none;">Bonus models (Strix Halo coverage only)</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#what-each-box-is-actually-best-for" style="font-size: 0.9375rem; text-decoration: none;">What each box is actually best for</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#what-id-do-differently" style="font-size: 0.9375rem; text-decoration: none;">What I'd do differently</a></li>
<li style="margin-bottom: 0.35rem; "><a href="#the-end-result" style="font-size: 0.9375rem; text-decoration: none;">The end result</a></li>
</ul>
</div>



<h2 id="the-contestants">The contestants</h2>

<table>
<thead>
<tr>
<th>Spec</th>
<th>Strix Halo (the miniPC)</th>
<th>A30</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPU/APU</td>
<td>AMD Ryzen AI MAX+ 395 (gfx1151)</td>
<td>NVIDIA A30 PCIe</td>
</tr>
<tr>
<td>Architecture</td>
<td>RDNA3.5-class iGPU, unified memory (UMA)</td>
<td>Ampere GA100, dedicated PCIe</td>
</tr>
<tr>
<td>Memory</td>
<td>128 GiB LPDDR5 (shared CPU+GPU)</td>
<td>24 GiB HBM2</td>
</tr>
<tr>
<td>Backend</td>
<td>llama.cpp Vulkan (RADV)</td>
<td>llama.cpp CUDA</td>
</tr>
<tr>
<td>TDP (effective)</td>
<td>100W (after <code>ryzenadj</code>)</td>
<td>165W</td>
</tr>
<tr>
<td>Host system</td>
<td>GMKtec NucBox EVO-X2</td>
<td>R740 / Proxmox VM (16c/64G)</td>
</tr>
<tr>
<td>Approx all-in cost</td>
<td>~$2,500</td>
<td>~$2-3K used (card only; bring your own server)</td>
</tr>
<tr>
<td>Largest model that fits</td>
<td>gpt-oss-120B (~64 GiB), qwen3-coder-next 80B-A3B (~45 GiB)</td>
<td>qwen2.5-coder-32B (Q4_K_M) at most</td>
</tr>
</tbody>
</table>

<p>The "what fits" row is the most important and most under-discussed difference, with one clarification: it means <em>fits entirely in VRAM, at full speed</em>. You can push a bigger model onto the A30 with hybrid GPU/CPU offload (I do exactly that later in the post), it just runs much slower once part of it spills out of the 24 GiB. The miniPC's unified memory has no such cliff: anything up to 128 GiB loads and runs on the iGPU at full speed, including a 120B model the A30 can't hold in VRAM at all.</p>

<p>It's worth noting what's doing the spilling on the A30 side: its host is an older Dell R740 on DDR4, so when a model overflows into system RAM, DDR4 bandwidth is the bottleneck. A newer DDR5 host with more memory channels would lift the hybrid numbers, but it would also cost more, which is sort of the point: the unified-memory miniPC sidesteps the spill entirely, for the price of a single mid-range box.</p>

<h2 id="the-benchmarks">The benchmarks</h2>

<p>Four things, each measuring something different:</p>

<ul>
<li><strong>Aider Polyglot</strong>, 225 multi-language <a href="https://exercism.org">Exercism</a> coding exercises, the model is asked to edit existing files to make tests pass. This is the only benchmark on the list that resembles real-world agentic coding work, and it's the one frontier models actually struggle with. Not saturated.</li>
<li><strong><a href="https://github.com/evalplus/evalplus">HumanEval+</a></strong>, function-level code generation, 164 problems. Top models all score 90%+. Saturated.</li>
<li><strong><a href="https://github.com/EleutherAI/lm-evaluation-harness">lm_eval</a></strong> (<a href="https://github.com/openai/grade-school-math">gsm8k</a>, <a href="https://github.com/google-research/google-research/tree/master/instruction_following_eval">ifeval</a>, <a href="https://github.com/TIGER-AI-Lab/MMLU-Pro">mmlu_pro</a>), knowledge and instruction-following at single-prompt level. Frontier models saturate this too.</li>
<li><strong><a href="https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench">llama-bench</a></strong>, pure throughput, no quality signal. Two numbers matter, both in tokens/sec: <strong>pp</strong> (prompt processing, how fast the model ingests your prompt) and <strong>tg</strong> (token generation, how fast it writes the reply). I report the defaults <code>pp512 / tg128</code> (a 512-token prompt, 128 generated tokens) plus depth tests for long-context behavior.</li>
</ul>

<p>I treat Polyglot as the load-bearing quality metric because (a) it actually discriminates, and (b) it's what I care about, agentic coding is what these boxes get used for in practice.  If there's a benchmark I didn't run, it is because I don't know about it or didn't think of it.</p>

<p>One reading convention for every table below: higher is better, unless I explicitly say otherwise.</p>

<h2 id="quality-strix-halo-vs-a30">Quality: Strix Halo vs A30</h2>

<p>Nobody expects a model to get smarter or dumber depending on whether it runs on AMD or NVIDIA silicon, and it doesn't. This was never really an open question. But I had both boxes and was running the benchmarks anyway, so I figured I'd confirm it and see where any real differences showed up. The answer: quality is the same within noise, with one model-specific surprise. Boring but necessary setup for everything that follows.</p>

<table>
<thead>
<tr>
<th>Model</th>
<th>Strix Halo (Vulkan)</th>
<th>A30 (CUDA)</th>
</tr>
</thead>
<tbody>
<tr>
<td>gemma-3-27b-it (HumanEval+ p@1+)</td>
<td>78.7%</td>
<td>77.4%</td>
</tr>
<tr>
<td>qwen2.5-coder-32b (HumanEval+ p@1+)</td>
<td>85.4%</td>
<td>86.6%</td>
</tr>
<tr>
<td>qwen3-30b-a3b-2507 (HumanEval+ p@1+)</td>
<td>89.0%</td>
<td>89.6%</td>
</tr>
<tr>
<td>qwen2.5-coder-32b (Polyglot)</td>
<td>25/225 (11.1%)</td>
<td>25/225 (11.1%)</td>
</tr>
<tr>
<td>qwen3.6 (Polyglot, no think)</td>
<td>121/225 (53.8%)</td>
<td>106/225 (47.1%)</td>
</tr>
</tbody>
</table>

<p>("p@1+" is pass@1 on EvalPlus's extended test set, meaning the model's first answer has to pass every test.)</p>

<p>Within noise on HumanEval+ across the board, and on the qwen2.5-coder Polyglot row. The qwen3.6 Polyglot row shows a 6.7pp cross-host gap (53.8% Strix Halo vs 47.1% A30), which is larger than I'd expect from pure sampling noise; possibly a real CUDA-vs-Vulkan difference for that specific model and harness, or a build-version skew between the two boxes. The HumanEval+ gemma/qwen2.5/qwen3-30b rows on the same model files agree exactly cross-host, so it isn't a general "the A30 produces worse logits" pattern; it's a qwen3.6-Polyglot-specific finding I'd want to dig into in a future bench.</p>

<p>So the model is mostly the model, and hardware doesn't make it dumber in any general sense. There can be model-specific cross-host quirks worth checking (this one came as a surprise to me), but for the typical case, once you've picked a model that fits, the hardware question reduces to <em>how fast</em> and <em>can it even fit</em>.</p>

<h2 id="quality-local-vs-the-frontier">Quality: Local vs the Frontier</h2>

<p>Here's where it gets fun. I'm going to split this into coding and non-coding because they behave very differently.</p>

<h3 id="coding-specific-aider-polyglot">Coding-specific: Aider Polyglot</h3>

<p>Polyglot is the benchmark where frontier models still have headroom, and the one that tracks "how good is this thing as a coding agent." Here's the comparison (Aider leaderboard scores for the API models, my results for local):</p>

<table>
<thead>
<tr>
<th>Model</th>
<th>Polyglot pass rate</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>GPT-5 (high)</strong></td>
<td><strong>88.0%</strong></td>
<td>API</td>
</tr>
<tr>
<td><strong>Gemini-2.5-Pro (32k think)</strong></td>
<td><strong>83.1%</strong></td>
<td>API</td>
</tr>
<tr>
<td>DeepSeek-V3.2 Reasoner</td>
<td>74.2%</td>
<td>API (open weight ~700B, won't fit my hardware)</td>
</tr>
<tr>
<td><strong>Claude Opus 4 (32k think)</strong></td>
<td><strong>72.0%</strong></td>
<td>API</td>
</tr>
<tr>
<td>Claude Opus 4 (no think)</td>
<td>70.7%</td>
<td>API</td>
</tr>
<tr>
<td>Claude 3.7 Sonnet (32k think)</td>
<td>64.9%</td>
<td>API</td>
</tr>
<tr>
<td><strong>qwen3.6-thinking (Strix Halo)</strong></td>
<td><strong>62.2%</strong></td>
<td><strong>local, 35B-A3B MoE</strong></td>
</tr>
<tr>
<td>Claude Sonnet 4 (32k think)</td>
<td>61.3%</td>
<td>API</td>
</tr>
<tr>
<td>Claude 3.7 Sonnet (no think)</td>
<td>60.4%</td>
<td>API</td>
</tr>
<tr>
<td>Claude Sonnet 4 (no think)</td>
<td>56.4%</td>
<td>API</td>
</tr>
<tr>
<td>qwen3.6 (Strix Halo, no think)</td>
<td>53.8%</td>
<td>local</td>
</tr>
<tr>
<td>qwen3-coder-next (Strix Halo)</td>
<td>47.6%</td>
<td>local, 80B-A3B MoE; doesn't fit on A30, see the Speed sections</td>
</tr>
<tr>
<td>qwen3.6 (A30, no think)</td>
<td>47.1%</td>
<td>local</td>
</tr>
<tr>
<td>GPT-OSS-120B (high)</td>
<td>41.8%</td>
<td>leaderboard score, API</td>
</tr>
<tr>
<td>qwen3-30b-a3b-2507 (Strix Halo)</td>
<td>30.2%</td>
<td>local</td>
</tr>
<tr>
<td>qwen3-30b-a3b-2507 (A30)</td>
<td>28.9%</td>
<td>local</td>
</tr>
<tr>
<td>GPT-OSS-20B-thinking (A30)</td>
<td>16.9%</td>
<td>local</td>
</tr>
<tr>
<td>GPT-OSS-120B (Strix Halo, Q4_K_M)</td>
<td>1.8%*</td>
<td>local, almost certainly broken locally, see footnote</td>
</tr>
</tbody>
</table>

<p><em>* The 23× gap between local gpt-oss-120B (1.8%) and the same model's API leaderboard score (41.8%) is almost certainly the <code>reasoning_effort</code> parameter not wiring through to llama.cpp's gpt-oss path: low/medium/high produce near-identical outputs within sampling noise. For a model whose top-line capability <em>is</em> its reasoning depth, a broken reasoning knob is a broken model. Full discussion in item 3 below.</em></p>

<p>(All API model scores in this table come from the <a href="https://aider.chat/docs/leaderboards/">Aider Polyglot leaderboard</a>, last updated 2025-11-20. A few newer frontier releases (Google's Gemini 3 and Anthropic's Claude Opus 4.5 / 4.7) exist but haven't been scored by the Aider team yet, so they aren't represented above. The most recent Gemini and Opus variants the leaderboard does have are Gemini 2.5 Pro 32k-think at 83.1% and Claude Opus 4 32k-think at 72.0%.)</p>

<p><strong>What this shows:</strong></p>

<ol>
<li>
<p><strong>Sonnet-class is achievable locally, in both thinking and no-think modes.</strong> My best local model (<code>qwen3.6-thinking</code>, a 35B-A3B MoE) sits right in the Claude Sonnet thinking band (62.2% vs Sonnet 4 thinking 61.3%). And on the apples-to-apples no-think comparison, qwen3.6 with thinking off (53.8%) is just 2.6pp under Claude Sonnet 4 no-think (56.4%); effectively tied within Polyglot's noise floor. So it's not just "Sonnet-class when allowed to think"; it's "Sonnet-class without needing to think." That second result was the bigger surprise.</p>
</li>
<li>
<p><strong>The recommendation has moved since my last post.</strong> Back then, <code>qwen3-coder-next</code> (80B-A3B) was the best local-fitting coding model I had, and the explicit subject of <a href="https://damenknight.com/running-frontier-coding-model-mini-pc/">the previous post</a>. <code>qwen3.6</code> didn't exist yet. Now it does, and it's straightforwardly better: 53.8% Polyglot at thinking-off (vs qwen3-coder-next's 47.6%), 62.2% at thinking-on, smaller VRAM footprint, faster throughput. If you've been running qwen3-coder-next on Strix Halo since my last post: try qwen3.6.</p>
</li>
<li>
<p><strong>The real gap is to GPT-5, Gemini 2.5 Pro, and Claude Opus.</strong> Those three are ~10-26pp ahead of my best local model. The Anthropic ladder is worth calling out specifically: qwen3.6-thinking (62.2%) is essentially tied with Sonnet 4 thinking (61.3%), but Anthropic's actual flagship is Opus, which scores 72.0%, about 10pp ahead of local. Then GPT-5 (88.0%) and Gemini 2.5 Pro thinking (83.1%) are the real top of the leaderboard. DeepSeek V3.2 Reasoner (74.2%) is the closest open-weight to that band, but at ~700B parameters it won't fit on either of my boxes.</p>
</li>
<li>
<p><strong>Local quants underperform their API counterparts catastrophically on some models.</strong> My local <code>gpt-oss-120B</code> Q4_K_M scored 1.8%; the leaderboard's <code>gpt-oss-120b (high)</code> scored 41.8%. That's a 23x gap, not a small one. Three things contribute: quantization, the <code>reasoning_effort</code> parameter <a href="#footnote-effort">doesn't actually wire through to the model on llama.cpp</a> (I verified this; low/medium/high produce near-identical outputs within sampling noise), and I used Aider's <code>whole</code> edit format vs the leaderboard's <code>diff</code>. The reasoning-effort issue is probably the biggest factor; gpt-oss is essentially a reasoning model, and if the reasoning depth knob is broken, the model is operating in something close to a "low effort" mode regardless of what you pass in.</p>
</li>
<li>
<p><strong>Thinking mode is meaningful when measured correctly.</strong> <code>qwen3.6</code> without thinking: 53.8%. With thinking: 62.2%. That's 8.4 pp of capability sitting behind a flag.</p>
</li>
</ol>

<h4>A caveat about polyglot versioning</h4>

<p>My test harness ran 225 exercises on most models, the same set as Aider's leaderboard. A few runs got 289 or 450 (multiple attempts per exercise from a config tweak); rates are still computed as <code>passed/total</code>. Edit-format matters too: I used <code>whole</code> because it's more robust to weaker models, while Aider's leaderboard uses <code>diff</code> because it gets better scores from the top models. <code>whole</code> is generally a slight handicap. Treat the comparisons as directional, not exact.</p>

<p>Methodology note: all HumanEval+ numbers in this post come from <code>evalplus.codegen</code>, the canonical scorer behind EvalPlus's published leaderboard.</p>

<h3 id="saturated-benchmarks-humaneval-and-lmeval">Saturated benchmarks: HumanEval+ and lm_eval</h3>

<p>These are the benches where frontier and local models all score in the same 85-95% range, they don't discriminate well anymore. Quick look:</p>

<p>Cross-host gsm8k + ifeval, identical Q4_K_M quantization, identical chat-completions API, 200 items each:</p>

<table>
<thead>
<tr>
<th>Model</th>
<th>Strix Halo gsm8k</th>
<th>A30 gsm8k</th>
<th>Strix Halo ifeval</th>
<th>A30 ifeval</th>
</tr>
</thead>
<tbody>
<tr>
<td>gemma-3-4b-it</td>
<td>78.5%</td>
<td>82.5%</td>
<td>69.5%</td>
<td>68.0%</td>
</tr>
<tr>
<td>qwen3-4b-2507</td>
<td>90.0%</td>
<td>90.0%</td>
<td>79.5%</td>
<td>80.5%</td>
</tr>
<tr>
<td>gemma-3-12b-it</td>
<td>92.5%</td>
<td>91.0%</td>
<td>72.0%</td>
<td>72.5%</td>
</tr>
<tr>
<td>gemma-3-27b-it</td>
<td>93.5%</td>
<td>93.5%</td>
<td>77.5%</td>
<td>76.0%</td>
</tr>
<tr>
<td>qwen3-30b-a3b-2507</td>
<td>95.5%</td>
<td>93.5%</td>
<td>79.5%</td>
<td>80.5%</td>
</tr>
<tr>
<td>qwen2.5-coder-32b</td>
<td>95.0%</td>
<td>95.0%</td>
<td>75.0%</td>
<td>75.0%</td>
</tr>
<tr>
<td>phi-4 (14B)</td>
<td>90.5%</td>
<td>91.0%</td>
<td>56.5%</td>
<td>57.0%</td>
</tr>
<tr>
<td>mistral-small-3.2-24b</td>
<td>95.0%</td>
<td>94.5%</td>
<td>76.5%</td>
<td>72.0%</td>
</tr>
<tr>
<td>qwen3.6-thinking</td>
<td><strong>96.5%</strong></td>
<td>n/a (model not on A30)</td>
<td>79.0%</td>
<td>n/a</td>
</tr>
<tr>
<td>gpt-oss-20b (reasoning off)</td>
<td>87.5%</td>
<td>n/a (different run config)</td>
<td>25.5%</td>
<td>n/a</td>
</tr>
</tbody>
</table>

<p>The cross-host rows agree within 1–2 percentage points across the board. <em>Same model, same quantization, same prompt, same score within noise</em>.</p>

<p>Notable observations:</p>

<ul>
<li><strong>qwen3.6-thinking tops gsm8k at 96.5%</strong>, better than any A30 result in this set, and in the same ~95–97% band frontier models hit on saturated math benches before the frontier moved on to <a href="https://maa.org/maa-invitational-competitions/">AIME</a> / <a href="https://epoch.ai/frontiermath">FrontierMath</a>. On a 35B-A3B MoE running on a miniPC.</li>
<li><strong>gpt-oss-20b ifeval at 25.5%</strong> is shockingly low for a model that hits 87.5% on gsm8k in the same run. This is the <code>--reasoning off</code> configuration. The other gpt-oss-20b runs in my data, <code>reasoning on</code> variants, also fall in the 25–36% ifeval band, so this isn't a reasoning-flag artifact; gpt-oss-20b just struggles with strict prompt-following regardless. Worth knowing if you were planning to deploy it for instruction-bound tasks.</li>
<li><strong>phi-4 inverted profile</strong>: 90.5% gsm8k but only 56.5% ifeval. It's a math-strong, instruction-weaker model. Useful data point for choosing models by use case.</li>
</ul>

<p>Same story as the coding benches: <em>local models are basically tied with frontier on saturated benchmarks</em>. A Qwen3-30B-A3B on a miniPC scores 95.5% on gsm8k, comfortably in the same band as any frontier model that's still being measured against gsm8k. The frontier moat still exists, but it's on real-world agentic coding (Polyglot).</p>

<h3 id="quality-at-depth-niah">Quality at depth: NIAH</h3>

<p>Throughput at 65K context is meaningless if the model can't actually <em>find</em> anything at 65K. I tested <a href="https://github.com/gkamradt/LLMTest_NeedleInAHaystack">needle-in-haystack</a> retrieval (single-needle: ask "what's the best thing to do in San Francisco?" with a sandwich-and-Dolores-Park needle planted at 10%, 50%, or 90% depth in a haystack of Paul Graham essays):</p>

<table>
<thead>
<tr>
<th>Model</th>
<th>Host</th>
<th>Pass rate (4K/16K/32K/60K × 3 depths)</th>
</tr>
</thead>
<tbody>
<tr>
<td>qwen3.6-thinking</td>
<td>Strix Halo</td>
<td><strong>100%</strong> (12/12)</td>
</tr>
<tr>
<td>qwen3-coder-next</td>
<td>Strix Halo</td>
<td><strong>100%</strong> (12/12)</td>
</tr>
<tr>
<td>qwen3-30b-a3b-2507</td>
<td>A30</td>
<td>100% (12/12)</td>
</tr>
<tr>
<td>qwen2.5-coder-32b</td>
<td>A30</td>
<td>100% (12/12)</td>
</tr>
<tr>
<td>gemma-3-27b-it</td>
<td>A30</td>
<td>100% (12/12)</td>
</tr>
<tr>
<td>phi-4</td>
<td>A30</td>
<td>100% (12/12)</td>
</tr>
<tr>
<td>mistral-small-3.2-24b</td>
<td>A30</td>
<td>100% (12/12)</td>
</tr>
<tr>
<td>llama-4-scout-17b-16e</td>
<td>A30</td>
<td>100% (12/12)</td>
</tr>
<tr>
<td>qwen3-4b-2507</td>
<td>A30</td>
<td>100% (12/12)</td>
</tr>
<tr>
<td>gpt-oss-20b</td>
<td>A30</td>
<td>91.7% (11/12)</td>
</tr>
<tr>
<td>granite-3.1-8b-instruct</td>
<td>A30</td>
<td>88.9% (failed at depth)</td>
</tr>
<tr>
<td>deepseek-coder-v2-lite</td>
<td>A30</td>
<td><strong>33.3%</strong> (4/12, all 4K passes, 1 of 3 at 16K, every 32K and 60K cell timed out at 600s), same root cause as the llama-bench MLA cliff: CUDA-on-MLA is too slow at depth to finish a single query inside any reasonable budget</td>
</tr>
</tbody>
</table>

<p><em>Top-row finding</em>: Strix Halo's qwen3.6-thinking and qwen3-coder-next both score perfect retrieval at 60K context, with response times of 1-2 min per query. The model isn't just <em>running</em> with that context, it's actually <em>using</em> it. Combined with the throughput numbers, this is what makes the miniPC a real coding-agent target rather than a benchmark curiosity.</p>

<h2 id="speed-default-throughput">Speed: default throughput</h2>

<p>Quality matters; speed matters more than people think. A 62% model running at 1 tok/s is unusable. A 50% model at 80 tok/s is a daily driver.</p>

<p>(Methodology note before the tables: every Strix Halo throughput number below was collected with no other model servers running, fans pinned to max, and free memory verified before each run. There's a bench wrapper now that refuses to start without those conditions met. I ended up writing it after melting the poor machine twice, details in <a href="#what-id-do-differently">What I'd do differently</a> below.)</p>

<p>Default <code>pp512 / tg128</code> numbers (Q4_K_M, <code>-fa 1</code>, Strix Halo on q8_0 KV / A30 on q4_0 KV, see longctx section for the protocol note). Throughput is in tokens/sec, so higher is better. The last column is the one place bigger isn't better: it's the A30/miniPC tg ratio, where above 1.0 means the A30 is faster and below 1.0 means the miniPC wins (I flag those rows inline).</p>

<table>
<thead>
<tr>
<th>Model</th>
<th>Strix Halo pp</th>
<th>A30 pp</th>
<th>Strix Halo tg</th>
<th>A30 tg</th>
<th>A30/Strix Halo tg ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td>qwen3-4b-2507</td>
<td>2048</td>
<td>3934</td>
<td>75.0</td>
<td>118.9</td>
<td>1.59x</td>
</tr>
<tr>
<td>gemma-3-4b-it</td>
<td>2257</td>
<td>4431</td>
<td>74.1</td>
<td>109.5</td>
<td>1.48x</td>
</tr>
<tr>
<td>gemma-3-12b-it</td>
<td>750</td>
<td>1590</td>
<td>26.9</td>
<td>52.4</td>
<td>1.95x</td>
</tr>
<tr>
<td>phi-4 (14B)</td>
<td>652</td>
<td>1452</td>
<td>24.1</td>
<td>53.2</td>
<td>2.21x</td>
</tr>
<tr>
<td>gpt-oss-20b</td>
<td>1287</td>
<td>2805</td>
<td>80.8</td>
<td>130.2</td>
<td>1.61x</td>
</tr>
<tr>
<td>mistral-small-3.2-24b</td>
<td>267</td>
<td>905</td>
<td>15.3</td>
<td>34.5</td>
<td>2.26x</td>
</tr>
<tr>
<td>gemma-3-27b-it</td>
<td>230</td>
<td>771</td>
<td>12.6</td>
<td>28.0</td>
<td>2.23x</td>
</tr>
<tr>
<td>qwen3-30b-a3b-2507 (MoE)</td>
<td>1167</td>
<td>2274</td>
<td>87.0</td>
<td>136.2</td>
<td>1.56x</td>
</tr>
<tr>
<td>qwen2.5-coder-32b</td>
<td>186</td>
<td>633</td>
<td>11.1</td>
<td>24.1</td>
<td>2.18x</td>
</tr>
<tr>
<td>qwen3-coder-next (80B-A3B)†</td>
<td>551</td>
<td>110</td>
<td>56.4</td>
<td>12.2</td>
<td><strong>0.22x</strong> ← Strix Halo wins 4.6x</td>
</tr>
<tr>
<td>qwen3.6 (35B-A3B MoE)</td>
<td>944</td>
<td>1933</td>
<td>67.1</td>
<td>99.9</td>
<td>1.49x</td>
</tr>
</tbody>
</table>

<p>† The A30 row for qwen3-coder-next is <strong>hybrid GPU/CPU offload</strong> (22 of 49 layers on GPU, the rest on CPU/RAM). The 45 GiB Q4_K_M model can't fit fully in 24 GiB VRAM, so this is what you get if you force it onto the A30 anyway, the apples-to-apples speed cost of exceeding the VRAM ceiling on a dedicated GPU.</p>

<p>Two stories here:</p>

<p><strong>1. A30 wins at default by 2-3x.</strong> Expected, a dedicated GPU with proper VRAM and CUDA kernels should beat an APU running Vulkan. The factor is consistent across dense models in the 2.2-2.8x range.</p>

<p><strong>2. MoE narrows the gap and makes the miniPC viable.</strong> Look at qwen3-30b-a3b-2507: A30/Strix Halo ratio is just 1.56x for tg, the smallest gap in the table among the bigger models. That's because the model only activates ~3B params per token. Memory bandwidth matters more than raw compute for tg, and Strix Halo's UMA gives it surprisingly good bandwidth for active-parameter-light workloads. (The 4B models also show ratios below 2x, small models stop benefiting from the A30's compute headroom because they're already bandwidth-bound on both boxes.)</p>

<p>Compare that to the dense qwen2.5-coder-32b: 11.1 tok/s on Strix Halo vs 24.1 on A30, still a 2.18x gap but the absolute number is terrible on Strix Halo. I don't know about the rest of you, but 11 tok/s on a 32B dense model is not exactly what I'd call "usable". I'd never reach for the dense coder if a comparable-quality MoE exists.</p>

<h2 id="speed-at-long-context">Speed at long context</h2>

<p>Now the fun part.  Wait, I already said that.  Another fun part! Coding agents send long context (the codebase, the test results, previous turns), so what happens when you push the depth?</p>

<p>I ran the same <code>pp512 / tg128</code> test at depths 0 / 8K / 32K / 65K. Strix Halo is benched with q8_0 KV cache (matches how the production llama-servers are deployed). A30's previously-collected longctx sweep was at q4_0 KV; the small protocol asymmetry is mildly conservative for Strix Halo at depth (q4_0 saves a bit of KV bandwidth at the cost of dequant overhead, within MC noise on this hardware, but if anything it shaves a few percent off the Strix Halo side at deep contexts).</p>

<h3 id="gpt-oss-20b-fits-both">gpt-oss-20b (fits both)</h3>

<table>
<thead>
<tr>
<th>Depth</th>
<th>Strix Halo pp</th>
<th>A30 pp</th>
<th>Strix Halo tg</th>
<th>A30 tg</th>
<th>A30/Strix Halo tg</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 (default)</td>
<td>1287</td>
<td>2805</td>
<td>80.8</td>
<td>130.2</td>
<td>1.61x</td>
</tr>
<tr>
<td>8K</td>
<td>958</td>
<td>2522</td>
<td>66.6</td>
<td>109.5</td>
<td>1.64x</td>
</tr>
<tr>
<td>32K</td>
<td>547</td>
<td>1933</td>
<td>56.9</td>
<td>77.2</td>
<td>1.36x</td>
</tr>
<tr>
<td>65K</td>
<td>338</td>
<td>1452</td>
<td>45.6</td>
<td>54.5</td>
<td><strong>1.20x</strong></td>
</tr>
</tbody>
</table>

<p>A30 tg dropped 58% from default to 65K depth (130 to 55 tok/s). Strix Halo tg dropped 44% over the same range (81 to 46 tok/s). A30 still wins on this model at every depth, but the lead shrinks dramatically as context grows, the A30/Strix Halo ratio compresses from 1.61x at default to 1.20x at 65K.</p>

<h3 id="qwen3-30b-a3b-2507-fits-both">qwen3-30b-a3b-2507 (fits both)</h3>

<table>
<thead>
<tr>
<th>Depth</th>
<th>Strix Halo pp</th>
<th>A30 pp</th>
<th>Strix Halo tg</th>
<th>A30 tg</th>
<th>A30/Strix Halo tg</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 (default)</td>
<td>1167</td>
<td>2274</td>
<td>87.0</td>
<td>136.2</td>
<td>1.56x</td>
</tr>
<tr>
<td>8K</td>
<td>533</td>
<td>1746</td>
<td>62.1</td>
<td>72.5</td>
<td>1.17x</td>
</tr>
<tr>
<td>32K</td>
<td>205</td>
<td>1012</td>
<td>40.2</td>
<td>35.1</td>
<td><strong>0.87x</strong> ← Strix Halo wins</td>
</tr>
<tr>
<td>65K</td>
<td>110</td>
<td>631</td>
<td>28.1</td>
<td>20.6</td>
<td><strong>0.73x</strong> ← Strix Halo wins by 36%</td>
</tr>
</tbody>
</table>

<p><em>This is where it gets spicy.</em> A30 tg dropped <em>85%</em> from default to 65K (136 to 21 tok/s), the 24 GiB VRAM ran out of room for a meaningful KV cache at depth. Strix Halo tg dropped 68% over the same range (87 to 28 tok/s), painful but consistent. Crossover happens between 8K and 32K context. At 32K the miniPC is already faster; at 65K it's 36% faster than the dedicated GPU.</p>

<p>The model itself is 17 GiB Q4_K_M. The A30 has 24 GiB of VRAM. At 65K context the KV cache plus activations plus the model are competing for that 7 GiB headroom, and CUDA's memory management gets bottlenecked. Strix Halo's 128 GiB UMA doesn't care, there's so much memory headroom that the only constraint is compute and bandwidth, both of which degrade gracefully.</p>

<h3 id="qwen36-thinking-strix-halo-only-sonnet-tier-model">qwen3.6-thinking (Strix Halo only, Sonnet-tier model)</h3>

<p>This is the model I'd actually use for coding. The numbers are remarkable:</p>

<table>
<thead>
<tr>
<th>Depth</th>
<th>Strix Halo pp</th>
<th>Strix Halo tg</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 (default)</td>
<td>944</td>
<td>67.1</td>
</tr>
<tr>
<td>8K</td>
<td>790</td>
<td>61.8</td>
</tr>
<tr>
<td>32K</td>
<td>517</td>
<td>55.6</td>
</tr>
<tr>
<td>65K</td>
<td>349</td>
<td><strong>45.5</strong></td>
</tr>
</tbody>
</table>

<p>tg drops <em>32%</em> from default to 65K depth (67.1 to 45.5 tok/s). A Sonnet-class model running locally at 45 tok/s with a 65K-token context window. That's <em>actually usable</em> for serious agentic coding, you can pack a meaningful chunk of a codebase into the context and not pay a brutal speed tax for it.</p>

<p><strong>A note on Q8_0:</strong> I also ran the no-think qwen3.6 at Q8_0 (38 GiB on disk vs Q4_K_M's 20 GiB). Polyglot moved from 53.8% to 56.9%, a ~3 pp gain. Throughput dropped from 65 tok/s to 50 tok/s at default and is similarly proportional at depth. So if you have the disk and want every last point of Polyglot, Q8_0 is a real upgrade. If you'd rather have the speed, Q4_K_M is the right call, the quality gap is small relative to the speed cost.</p>

<h3 id="qwen3-coder-next-80b-a3b-strix-halo-uma-vs-a30-hybrid-offload">qwen3-coder-next 80B-A3B (Strix Halo UMA vs A30 hybrid offload)</h3>

<p>The 80B-A3B that motivated <a href="https://damenknight.com/running-frontier-coding-model-mini-pc/">the last post</a>. At 45 GiB Q4_K_M it doesn't fit in 24 GiB VRAM, so the A30 column here is <strong>hybrid GPU/CPU offload</strong> (<code>-ngl 22</code>, 22 of 49 layers on GPU, the rest streamed from system RAM). Strix Halo's 128 GiB UMA swallows the full model and runs entirely on the iGPU:</p>

<table>
<thead>
<tr>
<th>Depth</th>
<th>Strix Halo pp</th>
<th>A30 hybrid pp</th>
<th>Strix Halo tg</th>
<th>A30 hybrid tg</th>
<th>A30/Strix Halo tg</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 (default)</td>
<td>551</td>
<td>110</td>
<td>56.4</td>
<td>12.2</td>
<td><strong>0.22x</strong> ← Strix Halo wins 4.6x</td>
</tr>
<tr>
<td>8K</td>
<td>500</td>
<td>109</td>
<td>52.5</td>
<td>9.3</td>
<td><strong>0.18x</strong> ← Strix Halo wins 5.6x</td>
</tr>
<tr>
<td>32K</td>
<td>372</td>
<td>109</td>
<td>46.9</td>
<td>5.4</td>
<td><strong>0.12x</strong> ← Strix Halo wins 8.7x</td>
</tr>
<tr>
<td>65K</td>
<td>256</td>
<td>106</td>
<td>38.1</td>
<td>3.9</td>
<td><strong>0.10x</strong> ← Strix Halo wins 9.8x</td>
</tr>
</tbody>
</table>

<p>This is the clearest "wrong tool for the job" result I had. The A30 is a <em>good</em> card, it just doesn't have enough VRAM to hold the model, and PCIe bandwidth between GPU and host RAM is roughly 30x slower than the A30's own HBM2. So every token has to drag activations across that bottleneck.</p>

<p>The math: A30 hybrid tg falls from 12.2 to 3.9 tok/s (a 68% drop) over the depth sweep, while Strix Halo's UMA tg falls from 56.4 to 38.1 (only 32%). The A30 falls off twice as steeply because attention has to read the full KV cache to produce each new token, and on hybrid mode roughly half the model's layers, plus their slice of the KV cache, live in CPU RAM (DDR4, on this server). Each token's attention op pays PCIe-bandwidth overhead, and that overhead scales with context length. So 4.6× at default and <em>9.8× at 65K</em>.</p>

<p>On the Strix system the story is the other way around: the iGPU has the same bandwidth to all 128 GiB as it does to the first 16 GiB. There's no VRAM cliff to fall off because there's no VRAM/RAM distinction at all. tg drops 32% from default to 65K (56.4 to 38.1 tok/s), painful but consistent, and at 38 tok/s with 65K of context loaded it's still... not fast, but usable.</p>

<p>(I also tried to run Aider Polyglot on A30 hybrid for a quality cross-check; the harness's per-call timeout repeatedly fired against the 3.9–9.3 tok/s hybrid response rate, and I abandoned the run after 9 of 225 exercises in ~5 hours. Throughput data above is from <code>llama-bench</code> directly, which doesn't have that problem.)</p>

<h3 id="deepseek-coder-v2-lite-the-bonus-weird-result">DeepSeek-Coder-V2-Lite, the bonus weird result</h3>

<p>I benchmarked this one for completeness, expecting nothing exciting. Instead I found one of the clearest "the dedicated GPU is broken here" results in the whole sweep. DeepSeek-V2's <a href="https://arxiv.org/abs/2405.04434">Multi-head Latent Attention (MLA)</a> uses a low-rank-projected KV cache that's smaller than standard MHA but requires a different attention kernel. The CUDA implementation in llama.cpp build 9064 falls off a cliff once any KV is present:</p>

<table>
<thead>
<tr>
<th>Depth</th>
<th>Strix Halo pp (Vulkan)</th>
<th>A30 pp (CUDA)</th>
<th>Strix Halo tg</th>
<th>A30 tg</th>
<th>A30/Strix Halo tg</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 (default)</td>
<td>1641</td>
<td>408</td>
<td>106.0</td>
<td>88.5</td>
<td><strong>0.83x</strong> ← Strix Halo wins</td>
</tr>
<tr>
<td>8K</td>
<td>1032</td>
<td>17</td>
<td>64.4</td>
<td>4.9</td>
<td><strong>0.08x</strong> ← Strix Halo wins 13x</td>
</tr>
<tr>
<td>32K</td>
<td>484</td>
<td>wedged</td>
<td>30.8</td>
<td>wedged</td>
<td>n/a</td>
</tr>
<tr>
<td>65K</td>
<td>250</td>
<td>wedged</td>
<td>17.4</td>
<td>wedged</td>
<td>n/a</td>
</tr>
</tbody>
</table>

<p>The A30 bench actually wedged my harness, at d=32K, the CUDA kernel grinds at ~3-5 tok/s prefill, which means a single measurement of the 32K-token prefill would take 100+ minutes. I killed it after 17 minutes of no progress.</p>

<p>Strix Halo's Vulkan path handles MLA at depth normally, degrading from 106 to 17 tok/s tg is a real cliff, but it's a <em>finite</em> one and the bench actually finishes. <em>Even at d=0 Strix Halo is 4× faster on pp512 (1641 vs 408), and that's before any KV is in play.</em> The CUDA backend isn't just slow at depth on this architecture, it's just slow on this architecture.</p>

<p>This isn't a hardware issue, I don't think, it's a software bug.  Presumably to be fixed in some future llama.cpp release lol. But for anyone considering DeepSeek-V2-family models for coding <em>right now</em> the miniPC is the only sensible target. A 24 GiB A30 will load the model just fine and then be fairly unusable.</p>

<h3 id="bonus-models-strix-halo-coverage-only">Bonus models (Strix Halo coverage only)</h3>

<p>For completeness, three more models I benchmarked on Strix Halo to fill out the table:</p>

<table>
<thead>
<tr>
<th>Model</th>
<th>Default pp</th>
<th>Default tg</th>
<th>65K pp</th>
<th>65K tg</th>
<th>Note</th>
</tr>
</thead>
<tbody>
<tr>
<td>granite-3.1-8b-instruct</td>
<td>996</td>
<td>39.5</td>
<td>(crashed)</td>
<td>(crashed)</td>
<td>Vulkan device-lost at d=65K, got d=0/8K/32K only</td>
</tr>
<tr>
<td>llama-4-scout-17b-16e</td>
<td>159</td>
<td>20.1</td>
<td>105</td>
<td>13.9</td>
<td>17B-active, 109B-total, slowest in the post but flattest depth scaling (only 31% tg drop)</td>
</tr>
</tbody>
</table>

<h2 id="what-each-box-is-actually-best-for">What each box is actually best for</h2>

<ul>
<li><strong>Strix Halo as a coding agent:</strong> qwen3.6 with thinking on when I want quality, qwen3.6 with thinking off when I want speed/quality balance. Same model file, same throughput, just flip the <code>--reasoning</code> flag.</li>
<li><strong>A30 for serving small concurrent requests:</strong> gpt-oss-20b at 130 tok/s or qwen3-30b-a3b at 136 tok/s is great for embeddings, rerank, and utility models in a stack.</li>
</ul>

<p>These are different jobs. The boxes aren't substitutes; they're complements.</p>

<h2 id="what-id-do-differently">What I'd do differently</h2>

<ol>
<li>
<p><strong>Update everything to the latest first.</strong> I spent a week chasing scores that looked too low only to realize my llama.cpp was 700 commits behind on reasoning-channel handling. Thinking models scored 0% on lm_eval because the reasoning content was consuming the entire context budget. A rebuild fixed it. This stuff moves fast, llama.cpp lands fixes weekly, so pull and rebuild to the latest before you trust a single number.</p>
</li>
<li>
<p><strong>Bench with <code>-d</code> from the start, not <code>-c</code>.</strong> The <code>-c</code> arg got removed from llama-bench in recent builds; the replacement is <code>-d <depth></code> for testing tg at a given KV depth. My first A30 long-context sweep died at parse time. Trivial fix in retrospect, but it cost me half a day.</p>
</li>
<li>
<p><strong>Don't trust HumanEval+ as a discriminator.</strong> Everything competent scores 85%+. The bench doesn't separate "okay" from "great." Polyglot is what actually matters; I should have run it first.</p>
</li>
<li>
<p><strong>Run <code>whole</code> and <code>diff</code> edit formats both.</strong> I ran everything in <code>whole</code> because it's robust for weak models. That makes the strong-model comparisons against Aider's leaderboard (which uses <code>diff</code>) slightly unfair to the local models. Doing both would have given a cleaner local-vs-API comparison.</p>
</li>
<li>
<p><strong>Treat thermals and bench cleanliness as first-class concerns.</strong> Two specific traps cost me roughly a week of redo work:</p>
<ul>
<li><strong>Don't re-make the same thermal mistakes as last time.</strong> I already worked this box's thermals out in the last post: sustained GPU load trips it unless you cap power with <code>ryzenadj</code> and pin the fans manually, because the stock fan curve is tuned for desktop bursts, not back-to-back benchmarks holding the GPU near 100% for minutes at a time. Then I forgot to actually turn any of that on before kicking off a multi-hour sweep, and crashed the box twice (no kernel log, just unreachable until a power-cycle) rediscovering a lesson I'd already written down. The fix was the one I already had on the shelf: <code>mode=fixed level=5</code> on all three fans (under <code>/sys/class/ec_su_axb35/fan*/</code>) before any sustained workload. The wrapper now refuses to start a bench unless the fans are confirmed above 3500 RPM.</li>
<li><strong>Keep other model servers cleared out the whole time, not just at the start.</strong> Any concurrent <code>llama-server</code> process <code>--mlock</code>'s its model into RAM and steals memory bandwidth from the bench. I caught this when a spot-check tg128 re-run came in 5% higher than the recorded number with everything else stopped. Five percent is small enough to miss in a single run and big enough to materially change rankings across models. The real trap is that it's easy to start clean and then let stray servers creep back in over a long session, so the fix isn't a one-time cleanup, it's re-verifying nothing else is loaded before every single run. Every Strix Halo throughput number in this post was collected that way, and the wrapper enforces it as a precondition.</li>
</ul>
<p>The meta lesson: a bench harness that <em>requires</em> you to remember the discipline will eventually run dirty. Make the harness refuse to run unless the conditions are met.</p>

<h2 id="the-end-result">The end result</h2>

<p>Use the right tool for the job.  Shocking, I know.</p>

<p>The miniPC can be a Sonnet-tier coding agent (when running the right model) that costs about $2,500 once and never sends my code anywhere. The A30 box is for smaller task-specific models that need high throughput.</p>

<p>The local-vs-frontier gap is still real on the hardest problems and on real agentic Polyglot work, but it's roughly Sonnet-class for daily-driver coding tasks, and the gap is closing. The next time someone benchmarks this, I expect the frontier-API moat to be at least a little bit smaller.</p>

<hr />

<a id="footnote-effort"></a>

<p>*Footnote: the <code>reasoning_effort</code> parameter not wiring through to llama.cpp's gpt-oss path is documented elsewhere; I verified by running effort=low/medium/high through lm_eval gsm8k and getting near-identical scores (90% / 86% / 86%), within the sampling noise band for a 200-item subset. If the flag were actually doing anything, I'd expect monotonic improvement from low to high; instead "high" is the same as "medium" and "low" comes out <em>higher</em> than both, which only makes sense if all three are effectively the same configuration plus sampling noise. A separate post about this might be coming.*</p>]]></content:encoded>
    </item>
    <item>
      <title>Running a Frontier Coding Model on an Under-$3K Mini PC</title>
      <link>https://damenknight.com/running-frontier-coding-model-mini-pc/</link>
      <guid isPermaLink="true">https://damenknight.com/running-frontier-coding-model-mini-pc/</guid>
      <pubDate>Thu, 12 Mar 2026 12:00:00 GMT</pubDate>
      <category>Homelab</category>
      <category>Projects</category>
      <description>I got Qwen3-Coder-Next (80B MoE) running at 46 tok/s on a under-$3K mini PC. It took a full OS reinstall, a firmware downgrade, kernel parameter archaeology, a…</description>
      <content:encoded><![CDATA[<p><strong>TL;DR</strong>: I got Qwen3-Coder-Next (80B MoE) running at 46 tok/s on an under-$3K mini PC. It took a full OS reinstall, a firmware downgrade, kernel parameter archaeology, a thermal crisis, and throwing out about half the tuning advice I found online. Here's everything I learned the hard way.</p>
<h2>Why This Hardware</h2>
<p>My existing GPU setups didn't have enough VRAM to run some of the larger models I was interested in testing. Discrete GPUs with 48+ GB of VRAM are absurdly expensive, and splitting a model across multiple consumer cards comes with its own headaches and PCIe bottleneck tax. So I started looking into UMA (Unified Memory Architecture) systems — where the CPU and GPU share the same memory pool — as a significantly more affordable way to get a ton of usable memory for inference.</p>
<p>That led me to the Ryzen AI MAX+ 395. It's a weird chip — a laptop/mini-PC APU with 32 Zen 5 cores, a 40-CU RDNA 3.5 iGPU, and support for up to 128 GB of LPDDR5 unified memory. Since the CPU and GPU share the same pool, the GPU can address all 128 GB without PCIe bottlenecks. For LLM inference, where model weights need to stream through the compute units every single token, that's a huge deal.</p>
<p>The theoretical memory bandwidth is 256 GB/s (LPDDR5X-8000 on a 256-bit bus). In practice I measured around 212-215 GB/s — about 82% efficiency. That's slower than an M4 Max (~546 GB/s) but faster than trying to cram a 70B model across two consumer GPUs and eating the PCIe tax.</p>
<p>The GMKtec NucBox EVO-X2 packages this chip into a mini PC chassis for under $3K with 128 GB RAM — though with the way LPDDR5 prices have been going lately, check current pricing before you get too excited. There are a few other options with this chip: Framework makes a Desktop, ASUS has the ROG Flow Z13 tablet, and Minisforum has the EliteMini AI Max. The GMKtec was the best price-to-performance option I found at the time, but it's worth shopping around.</p>
<h2>The OS: Rocky Linux 9.7</h2>
<p>I'm running Rocky Linux 9.7 — enterprise stability, good package ecosystem, SELinux actually works properly. Any RHEL 9 derivative should work similarly.</p>
<h2>The Three Things That Must Be Right</h2>
<p>After the base OS was clean, I hit a wall. A really frustrating wall. Getting this hardware working properly requires <strong>three specific things to be correct</strong> — the right kernel, the right firmware, and thermal power limits that won't let the system cook itself to death. I'm going to cover all three here because skipping any one of them will ruin your day.</p>
<h3>1. Kernel 6.18.4 or newer</h3>
<p>The KFD (Kernel Fusion Driver) in older kernels has a page table bug specific to gfx1151. Any GPU tensor allocation triggers "Memory access fault: Page not present" errors. This was fixed upstream in kernel 6.18.4. Rocky 9's stock kernel is 6.12, which is too old.</p>
<p>I tried AMD's <code>amdgpu-dkms</code> package first (which backports the amdgpu driver to older kernels), but the DKMS version is pre-6.18 and doesn't include the KFD fix. No combination of kernel parameters — <code>HSA_ENABLE_SDMA=0</code>, <code>amd_iommu=off</code>, <code>amdgpu.noretry=0</code>, <code>amdgpu.cwsr_enable=0</code> — works around it. Trust me, I tried them all. You need the actual kernel fix.</p>
<p>The solution: ELRepo's <code>kernel-ml</code> package, which provides mainline kernels packaged for RHEL/Rocky. I installed 6.19.6 and it just worked.</p>
<pre><code class="language-bash">sudo dnf install -y elrepo-release
sudo rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
sudo dnf --enablerepo=elrepo-kernel install -y kernel-ml
</code></pre>
<h3>2. MES firmware version 0x80</h3>
<p>Even with kernel 6.19.6, I was still getting page faults. Cool. The second half of the puzzle is the MES (Micro Engine Scheduler) firmware. Rocky's <code>linux-firmware-20260130</code> package ships MES version 0x83, which is known to cause ROCm page faults on Strix Halo. The upstream linux-firmware repository explicitly reverted it with the commit message: "MES FW 0x83 is reported to cause ROCm page faults."</p>
<p>Rocky hadn't picked up the revert yet, and AMD's own <code>amdgpu-dkms-firmware</code> package <em>also</em> ships 0x83. So the fix is manual:</p>
<pre><code class="language-bash"># Download good firmware (version 0x80) from upstream revert commit
curl -sL -o /tmp/gc_11_5_1_mes1.bin \
  "https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/amdgpu/gc_11_5_1_mes1.bin?id=c092c7487eb7c3d58697f490ff605bc38f4cc947"
curl -sL -o /tmp/gc_11_5_1_mes_2.bin \
  "https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/amdgpu/gc_11_5_1_mes_2.bin?id=c092c7487eb7c3d58697f490ff605bc38f4cc947"

# Install to updates dir (takes priority over base firmware)
sudo cp /tmp/gc_11_5_1_mes1.bin /lib/firmware/updates/amdgpu/
sudo cp /tmp/gc_11_5_1_mes_2.bin /lib/firmware/updates/amdgpu/

# Rebuild initramfs and reboot
sudo dracut --force /boot/initramfs-$(uname -r).img $(uname -r)
sudo reboot
</code></pre>
<p>Verify after reboot:</p>
<pre><code class="language-bash">sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep MES
# Good: firmware version: 0x00000080
# Bad:  firmware version: 0x00000083
</code></pre>
<p>Once both pieces were in place, PyTorch passed all validation checks: tensor operations, all data types (fp32, fp16, bf16, int8), 4 GiB memory allocation, and ~1.05 TFLOPS on a 4096x4096 FP32 matmul. Finally.</p>
<p><strong>Lesson learned the hard way:</strong> Pin your firmware. I added <code>exclude=linux-firmware* amdgpu-dkms-firmware*</code> to <code>/etc/dnf/dnf.conf</code> to prevent package updates from sneaking MES 0x83 back in. Ask me how I know.</p>
<h3>3. Thermal Power Limits</h3>
<p><strong>This one might be the most important of the three, so don't skip it.</strong></p>
<p>While setting up a PyTorch benchmarking suite, the system started dying on me. At first I figured "oh weird, the host crashed" — but when I went to check on it, it wasn't just locked up. It was fully powered off. That's... not normal. Then it happened again. And again. Full hard power-off events with no warning, no logs, nothing.</p>
<p>I set up thermal monitoring logging every 5 seconds and caught the cause:</p>
<pre><code>19:00:07  Tctl=71°C   pwr=92W    ← normal inference
19:00:12  Tctl=91°C   pwr=165W   ← torch.compile spike
19:00:22  Tctl=93°C   pwr=164W   ← approaching TjMax (100°C)
19:00:27  Tctl=61°C   pwr=30W    ← thermal shutdown
</code></pre>
<p><code>torch.compile</code> triggers Triton/Inductor kernel compilation that simultaneously hammers all 32 CPU cores <em>and</em> the GPU. On a UMA APU where everything shares one thermal envelope in a mini PC chassis, that produces a 165W power spike — way past the 120W PPT Fast limit and far more than the little cooler can handle. The firmware thermal protection kicks in and just kills power. No graceful shutdown, just off.</p>
<p>Normal LLM inference is totally fine — 73-75W, 76-80°C, perfectly stable all day long. But the moment you hit a mixed CPU+GPU burst workload, you're rolling the dice. And it's not just <code>torch.compile</code> — anything that pegs the CPU and GPU simultaneously in this chassis can trigger it. I lost count of how many times the system just cut out on me before I got this sorted.</p>
<p>The fix is <strong>ryzenadj</strong>, a tool that lets you adjust AMD mobile power limits from Linux:</p>
<pre><code class="language-bash">sudo ryzenadj --fast-limit=100000 --tctl-temp=88
</code></pre>
<p>This caps burst power to 100W and sets the thermal target to 88°C, giving 12°C of headroom before TjMax. <strong>Do this immediately after your first boot, before you run anything heavy.</strong> I created a systemd service to persist these limits across reboots so they're always active. The GMKtec ships with BIOS 1.12 / EC 1.10 (the latest available), so there's no firmware fix coming — you've gotta manage this in software.</p>
<p>Other thermal improvements people recommend but I haven't tried yet: replacing the stock thermal paste with PTM7950 phase-change material, and the <code>ec_su_axb35</code> kernel module for Linux fan control. Maybe I'll get to those at some point.</p>
<h2>Understanding Unified Memory (It's Unintuitive)</h2>
<p>The BIOS has a "UMA Frame Buffer Size" setting that defaults to 64 GB. Your instinct says "big number = more GPU memory = good." Yeah, your instinct is wrong here.</p>
<p>On a traditional discrete GPU, VRAM is physically separate from system RAM. On Strix Halo, there's only one pool of LPDDR5. The BIOS carveout <em>reserves</em> a chunk of that pool as dedicated VRAM — the OS can't see it, can't use it for anything else, and the GPU doesn't even need it because it can access system RAM at the same speed through GTT (Graphics Translation Table).</p>
<p>The optimal configuration is:</p>
<ul>
<li><strong>BIOS VRAM: 2 GB</strong> (the minimum on the GMKtec's current BIOS 1.12 — you'll see guides online saying to set this to 512 MB, but that was only possible on earlier BIOS versions. 2 GB is as low as it goes now.)</li>
<li><strong>GTT: 124 GB</strong> (dynamically mapped, shared between CPU and GPU)</li>
</ul>
<p>This gives you ~124 GB usable for both CPU and GPU workloads, instead of 64 GB locked to GPU + 64 GB for CPU.</p>
<p>The kernel parameters to make this work:</p>
<pre><code>amdgpu.gttsize=126976          # 124 GiB GTT
ttm.pages_limit=29360128       # Allow TTM to manage 112 GiB of pages
ttm.page_pool_size=29360128    # Matching pool size
amdgpu.no_system_mem_limit=1   # Disable SVM resident memory cap
amd_iommu=off                  # Fully disable IOMMU (~4% bandwidth gain)
</code></pre>
<p>The <code>ttm.pages_limit</code> parameter is particularly sneaky. Without it, you can set GTT to 124 GB and the kernel will report 124 GB, but HIP/ROCm applications will only see ~62 GiB. The TTM subsystem has its own page limit that must match. And it has to be set at boot — runtime changes don't take effect. That one took a while to figure out.</p>
<p>On Rocky 9, updating kernel parameters has its own gotcha: editing <code>/etc/default/grub</code> and running <code>grub2-mkconfig</code> <strong>doesn't work</strong>. Rocky 9 uses BLS (Boot Loader Specification) entries, which have their own options line. Use <code>grubby</code> instead:</p>
<pre><code class="language-bash">grubby --update-kernel=DEFAULT --args="amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=29360128 ttm.page_pool_size=29360128 amdgpu.no_system_mem_limit=1"
</code></pre>
<h2>Building and Running llama.cpp</h2>
<p>Ok, with the hardware finally cooperating, I built llama.cpp. I started with ROCm/HIP since that's what everyone recommends for AMD GPUs:</p>
<pre><code class="language-bash">cmake -B build \
  -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 \
  -DGGML_NATIVE=OFF -DCMAKE_C_FLAGS='-march=znver4' -DCMAKE_CXX_FLAGS='-march=znver4' \
  -DGGML_AVX512=ON -DGGML_AVX512_VBMI=ON -DGGML_AVX512_VNNI=ON -DGGML_AVX512_BF16=ON \
  -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_LTO=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
</code></pre>
<p>A few build notes:</p>
<ul>
<li><code>-DGGML_NATIVE=OFF</code> with explicit <code>-march=znver4</code> is required because GCC 11 on Rocky 9 emits VNNI instructions that the system's binutils can't assemble. Specifying znver4 explicitly avoids the problematic auto-detection.</li>
<li>The AVX512 flags enable SIMD for CPU-side tensor ops. Zen 5 has full AVX-512 support.</li>
<li><code>GGML_HIP_ROCWMMA_FATTN</code> enables wave matrix multiply for flash attention.</li>
</ul>
<p><strong>Critical for APUs:</strong> You must set <code>GGML_CUDA_ENABLE_UNIFIED_MEMORY=1</code> before running. Without it, llama.cpp tries to allocate in the 2 GB dedicated VRAM carveout and fails for any model larger than 2 GB. With it, allocations go through the full GTT pool. Don't skip this or you'll be very confused.</p>
<h3>The Model</h3>
<p>I'm running <strong>Qwen3-Coder-Next Q4_K_M</strong> — an 80B parameter Mixture-of-Experts model with 3B active parameters, purpose-built for coding agents. At Q4_K_M quantization it's about 46 GiB across 4 GGUF shards, fitting comfortably in 128 GB with room for a 65K token context window.</p>
<p>The Mixture-of-Experts architecture is what makes this hardware viable. An 80B MoE model only needs to stream the active expert weights each token — roughly 3B parameters — not the full 80B. Dense 70B models? They crawl at 5-7 tok/s on this hardware. This 80B MoE? 46 tok/s. Same memory, same bandwidth — the model architecture makes all the difference.</p>
<p>This model scored #1 on SWE-rebench Pass@5 at 64.6%, beating Claude Opus 4.6 (58.3%). Running it locally at interactive speeds on a sub-$3K box (give or take, depending on what RAM prices are doing this week) is... pretty nuts.</p>
<h3>Runtime Configuration</h3>
<p>I run llama-server as a systemd service with these flags:</p>
<pre><code>-fa on              # Flash attention (smaller KV cache, faster attention)
--parallel 1        # Single slot — all memory for one user
-t 32 -tb 32       # All 32 CPU cores
-ub 2048            # Large ubatch for GPU utilization during prompt processing
-ctk q8_0 -ctv q8_0  # Quantized KV cache (~2x smaller than f16, minimal quality loss)
--mlock             # Pin model in RAM
-c 65536            # 65K context window
</code></pre>
<p>Two things I learned about GPU power modes: <code>profile_peak</code> sounds good but actually causes thermal throttling on an integrated GPU sharing the SoC thermal envelope. Generation dropped from 37.9 to 26.9 tok/s. Ouch. Use <code>high</code> instead — it clocks up aggressively but lets the thermal controller do its job.</p>

<h2>Tuning: What the Internet Got Wrong</h2>
<p>With the system stable, I went through every tuning recommendation I could find — a comprehensive "definitive guide" document and the <a href="https://strixhalo.wiki/AI/llamacpp-performance">strixhalo.wiki llama.cpp performance page</a>. I benchmarked each claim individually. A lot of them were wrong, at least for this hardware.</p>
<h3>Things that didn't matter</h3>
<p><strong><code>--no-mmap</code> vs <code>--mlock</code></strong>: Identical performance. pp=219.5/tg=37.7 vs pp=218.7/tg=38.0. On a UMA APU where GPU memory <em>is</em> system memory, both approaches effectively do the same thing. Pick whichever you prefer.</p>
<p><strong><code>-b 256</code> batch size</strong>: Slightly <em>worse</em> than the default <code>-ub 2048</code>. The claimed jump from 70 to 591 tok/s was for Qwen3-30B-A3B, a much smaller model with different memory access patterns. Don't copy batch size settings across models.</p>
<p><strong><code>ROCBLAS_USE_HIPBLASLT=1</code></strong>: No measurable effect on gfx1151 with this model. The "mandatory" claim may apply to other GPU architectures.</p>
<h3>Things that helped a little</h3>
<p><strong><code>amd_iommu=off</code></strong>: Real. Generation speed went from 38.0 to 39.4 tok/s — a 3.7% improvement. Not the claimed 6%, but free performance. I also bumped GTT from 112 GiB to 124 GiB in the same change.</p>
<h3>The big discovery: Vulkan beats ROCm</h3>
<p>Then I built llama.cpp with Vulkan instead of HIP, just to see what would happen:</p>
<pre><code class="language-bash">cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
</code></pre>
<p>The results were... not subtle:</p>
<table>
<thead>
<tr>
<th>Context</th>
<th>Vulkan pp (tok/s)</th>
<th>Vulkan tg (tok/s)</th>
<th>HIP pp (tok/s)</th>
<th>HIP tg (tok/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Default (512)</td>
<td>548</td>
<td>45.9</td>
<td>336</td>
<td>40.8</td>
</tr>
<tr>
<td>32K</td>
<td>394</td>
<td>36.8</td>
<td>91</td>
<td>29.7</td>
</tr>
<tr>
<td>65K</td>
<td>305</td>
<td>32.2</td>
<td>54</td>
<td>23.5</td>
</tr>
<tr>
<td>100K</td>
<td>213</td>
<td>28.2</td>
<td>36</td>
<td>18.7</td>
</tr>
</tbody>
</table>
<p>Vulkan with RADV (Mesa's open-source Vulkan driver) was 63% faster at prompt processing and 12% faster at generation at default context. The gap <em>widens</em> with context length — at 100K tokens, Vulkan is nearly 6x faster at prompt processing and 51% faster at generation.</p>
<p>This directly contradicts the common advice that "ROCm is better for long-context work." That may be true on datacenter GPUs (MI300X) or older desktop GPUs (gfx1100), but on gfx1151, the HIP compute kernels are known to run 2-6x slower than expected. Vulkan's cooperative matrix support through RADV doesn't have the same problem.</p>
<p>The guides also recommended AMDVLK (AMD's proprietary Vulkan driver) over RADV for 10-15% better performance. I investigated and found that AMD <strong>discontinued AMDVLK in September 2025</strong>, going all-in on RADV. The strixhalo.wiki's own benchmarks actually show RADV beating AMDVLK even before they killed it. Just use RADV.</p>
<p>One nice bonus: the Vulkan build doesn't need the <code>GGML_CUDA_ENABLE_UNIFIED_MEMORY=1</code> environment variable. That's a HIP/ROCm-specific workaround.</p>
<h2>The Boring But Important Stuff</h2>
<p>A handful of other things that aren't exciting but tripped me up:</p>
<p><strong>DNF firmware pinning</strong>: Added <code>exclude=linux-firmware* amdgpu-dkms-firmware*</code> to <code>/etc/dnf/dnf.conf</code>. Without this, a routine <code>dnf update</code> can reintroduce MES 0x83 and break GPU compute.</p>
<p><strong>EPEL rocminfo conflict</strong>: EPEL ships rocminfo 5.4.4 which conflicts with the ROCm 7.2 version from AMD's repo. Fixed with <code>dnf config-manager --save --setopt=epel.excludepkgs=rocminfo</code>.</p>
<p><strong>SELinux and systemd</strong>: The llama-server binary must live in <code>/usr/local/bin</code> (not <code>~/</code>) for SELinux to allow systemd to execute it. Run <code>restorecon -v</code> after copying.</p>
<p><strong>WiFi</strong>: The MediaTek MT7925 (Wi-Fi 7) works with WPA2 networks but fails on WPA2/WPA3 mixed-mode SSIDs. Suspected <code>mt7925e</code> driver bug. If your router broadcasts both, you may need a WPA2-only SSID.</p>
<p><strong>GPU performance mode</strong>: Set via udev rule to persist across reboots:</p>
<pre><code class="language-bash">echo 'ACTION=="add", SUBSYSTEM=="drm", KERNEL=="card0", ATTR{device/power_dpm_force_performance_level}="high"' \
  | sudo tee /etc/udev/rules.d/99-gpu-perf.rules
</code></pre>
<h2>What I'd Do Differently</h2>
<p>If I was setting this up again from scratch:</p>
<ol>
<li>
<p><strong>Start with Vulkan, not ROCm/HIP.</strong> I spent way too much time optimizing the HIP build before discovering Vulkan was faster at everything. Just build llama.cpp with <code>-DGGML_VULKAN=ON</code> from the start.</p>
</li>
<li>
<p><strong>Install ELRepo kernel immediately.</strong> Don't waste time trying to make the stock 6.12 kernel work with DKMS. It can't. I tried.</p>
</li>
<li>
<p><strong>Check MES firmware before debugging anything else.</strong> If <code>rocminfo</code> hangs or GPU compute produces page faults, check MES version first. It's the most common cause and the least obvious one.</p>
</li>
<li>
<p><strong>Set BIOS VRAM to minimum and maximize GTT from day one.</strong> The default 64 GB carveout wastes half your memory for no reason.</p>
</li>
<li>
<p><strong>Install ryzenadj before you do literally anything else.</strong> Seriously. The thermal shutdowns caught me completely off guard and happened repeatedly. The stock power limits on this chassis are not safe for sustained workloads. Cap power <em>first</em>, then start playing with models.</p>
</li>
</ol>
<h2>The End Result</h2>
<p>My final configuration:</p>
<table>
<thead>
<tr>
<th>Component</th>
<th>Setting</th>
</tr>
</thead>
<tbody>
<tr>
<td>OS</td>
<td>Rocky Linux 9.7, kernel 6.19.6 (ELRepo)</td>
</tr>
<tr>
<td>GPU driver</td>
<td>Mesa RADV 25.0.7 (Vulkan)</td>
</tr>
<tr>
<td>MES firmware</td>
<td>0x80 (manually installed)</td>
</tr>
<tr>
<td>BIOS VRAM</td>
<td>2 GB (minimum)</td>
</tr>
<tr>
<td>GTT</td>
<td>124 GiB</td>
</tr>
<tr>
<td>IOMMU</td>
<td>Fully disabled</td>
</tr>
<tr>
<td>Power limits</td>
<td>100W burst / 88°C target (ryzenadj)</td>
</tr>
<tr>
<td>llama.cpp</td>
<td>Vulkan build, flash attention, q8_0 KV cache</td>
</tr>
<tr>
<td>Model</td>
<td>Qwen3-Coder-Next Q4_K_M (80B MoE, 46 GiB)</td>
</tr>
<tr>
<td>Context</td>
<td>65K tokens</td>
</tr>
</tbody>
</table>
<p>Performance:</p>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Speed</th>
</tr>
</thead>
<tbody>
<tr>
<td>Token generation (short context)</td>
<td>45.9 tok/s</td>
</tr>
<tr>
<td>Token generation (32K context)</td>
<td>36.8 tok/s</td>
</tr>
<tr>
<td>Token generation (65K context)</td>
<td>32.2 tok/s</td>
</tr>
<tr>
<td>Token generation (100K context)</td>
<td>28.2 tok/s</td>
</tr>
<tr>
<td>Prompt processing (short context)</td>
<td>548 tok/s</td>
</tr>
</tbody>
</table>
<p>For a mini PC that cost me under $3K — though good luck getting that price if LPDDR5 keeps doing what it's been doing — running a frontier-class 80B coding model entirely locally, with 65K context and no API costs? I'm pretty happy with that.</p>
<hr />
<p><em>Tested on: GMKtec NucBox EVO-X2, AMD Ryzen AI MAX+ 395, 128 GB LPDDR5, Rocky Linux 9.7, kernel 6.19.6, llama.cpp build f90bd1dd8, Mesa RADV 25.0.7. March 2026.</em></p>]]></content:encoded>
    </item>
    <item>
      <title>Rescuing &quot;Unsupported&quot; Enterprise SSDs with Custom MegaRAID Tools</title>
      <link>https://damenknight.com/rescuing-unsupported-enterprise-ssds-megaraid-tools/</link>
      <guid isPermaLink="true">https://damenknight.com/rescuing-unsupported-enterprise-ssds-megaraid-tools/</guid>
      <pubDate>Sat, 31 Jan 2026 09:00:00 GMT</pubDate>
      <category>Homelab</category>
      <description>I picked up a pair of refurbished Samsung PM1643a SSDs for my Dell R740 Proxmox server, but my PERC H330 showed them as &quot;Unsupported&quot; with 0 KB size. They had…</description>
      <content:encoded><![CDATA[<p>I picked up a pair of refurbished Samsung PM1643a SSDs (3.84TB each) for my Dell R740 Proxmox server. Great deal on enterprise drives, right? Except when I installed them, my PERC H330 controller showed them as "UGUnsp" (Unconfigured Good Unsupported) with a size of... 0 KB.</p>

<p>The drives had been pulled from an enterprise storage array (likely Hitachi or EMC) and were formatted with 520-byte sectors instead of the standard 512. Those extra 8 bytes per sector are used for T10-DIF data integrity protection – useful in big SANs, useless for my homelab.</p>

<p>The usual fixes wouldn't work. <code>sg_format</code>? Can't see the drives. Samsung DC Toolkit? Nope. Perccli format/erase commands? "Operation not allowed." The controller refused to expose them to Linux at all – no <code>/dev/sd*</code>, no <code>/dev/sg*</code>. My only option seemed to be flashing the H330 to IT-mode, which meant unacceptable downtime.</p>

<p>Then I noticed something: <code>smartctl -d megaraid,4 -i /dev/sda</code> could actually talk to the drives via MegaRAID passthrough. The controller wouldn't expose them, but it would relay SCSI commands to them. That was my way in.</p>

<p>With some help from Claude Code, I dug into smartctl's source code and reverse-engineered the MegaRAID IOCTL interface. The result is a set of small C tools that send SCSI FORMAT UNIT and MODE SELECT commands directly through the MegaRAID passthrough – no HBA flash required, no downtime.</p>

<p>Both drives are now happily running at 512-byte sectors, showing their full 3.49 TiB each, and working perfectly as JBOD in Proxmox.</p>

<p>I've open-sourced the tools in case anyone else runs into this: <a href="https://github.com/filthyrake/megaraid_format_tools">github.com/filthyrake/megaraid_format_tools</a></p>]]></content:encoded>
    </item>
    <item>
      <title>Vlog??</title>
      <link>https://damenknight.com/vlog/</link>
      <guid isPermaLink="true">https://damenknight.com/vlog/</guid>
      <pubDate>Wed, 03 Dec 2025 21:07:35 GMT</pubDate>
      <category>Uncategorized</category>
      <description>Once upon a time I had a youtube channel. I mean I still do but most of my videos are now private and I stopped posting. I dont really want to reactivate my…</description>
      <content:encoded><![CDATA[
<p>Once upon a time I had a youtube channel.  I mean I still do but most of my videos are now private and I stopped posting.  I dont really want to reactivate my channel and be a YouTuber again, but I also dont want to just toss all that stuff out, so I’m standing up <a href="https://vlog.damenknight.com">vlog.damenknight.com</a> and migrating MOST of my old YouTube content over.</p>



<p>Now I DO want to create at least some content still – I clearly enjoy it – but doing it here instead of on YT will hopefully let me keep it a bit more chill and maybe more consistent.  Head on over, check it out as I get things migrated, and keep your eyes out in the future for all NEW content!  More car stuff, more astro stuff, more tech, you name it!</p>
]]></content:encoded>
    </item>
    <item>
      <title>Astrophotography Datasets Site</title>
      <link>https://damenknight.com/astrophotography-datasets-site/</link>
      <guid isPermaLink="true">https://damenknight.com/astrophotography-datasets-site/</guid>
      <pubDate>Mon, 05 May 2025 00:13:28 GMT</pubDate>
      <category>Astrophotography</category>
      <description>For a while now, I’ve been sharing my astrophotography datasets on a really basic mid-90’s-looking site I threw together. It was ugly, it was hard to use, it…</description>
      <content:encoded><![CDATA[
<p>For a while now, I’ve been sharing my astrophotography datasets on a really basic mid-90’s-looking site I threw together.  It was ugly, it was hard to use, it sucked.  But it served its purpose and I was ok with it.  I had bigger plans though.</p>



<p>You see, there aren’t a ton of sites that share astrophotography datasets – especially not for free.  Many people sell theirs, and the best free options have historically been places like NASA and the ESA.  I absolutely love those free resources, but I wanted something for the rest of us and with more variety.</p>



<p>So I’ve spent the past… many many many months working hard to build something new.  This was a pain, since I am not a web guy or a software guy and this involved both – and I’m far far from done still – but it is in a place where I’m happy to talk about it and start showing it off.</p>



<p>If you havent yet, head over to check it out: <a href="https://datasets.miscellaneousnerdery.com">Miscellaneous Datasets</a>.  It is fairly filterable by what kind of data you want.  All datasets can be downloaded in all the major formats.  Right now it is limited to data I’ve personally captured, but I’m hard at work getting more contributors on board.  My dream is that someday this will be the largest free non-government dataset resource in the world.</p>



<p></p>
]]></content:encoded>
    </item>
    <item>
      <title>Warewulf Home Lab Setup</title>
      <link>https://damenknight.com/warewulf-home-lab-setup/</link>
      <guid isPermaLink="true">https://damenknight.com/warewulf-home-lab-setup/</guid>
      <pubDate>Tue, 25 Mar 2025 19:13:18 GMT</pubDate>
      <category>Homelab</category>
      <category>Projects</category>
      <description>This cluster is something I setup to learn more about HPC. My initial project is using the cluster to do astrophotography image…</description>
      <content:encoded><![CDATA[
<figure><img decoding="async" width="1920" height="1440" src="https://damenknight.com/images/image-2-1920x1440.jpg" alt=""   /></figure>



<figure><img decoding="async" width="846" height="545" src="https://damenknight.com/images/image-1.png" alt=""   /></figure>



<details><summary>Control Node</summary>
<p>VM running on PowerEdge R730<br>Rocky Linux 9.5<br>Warewulf 6.4</p>



<p></p>
</details>



<details><summary>Compute Nodes</summary>
<p>1x MeLE Quieter 3C<br>* Intel N100<br>* 16GB RAM<br><br>3x MeLE Quieter 4C<br>* Intel N100<br>* 16GB RAM</p>
</details>



<details><summary>Connectivity</summary>
<p>TrendNet Switch</p>
</details>



<p>This cluster is something I setup to learn more about HPC.  My initial project is using the cluster to do astrophotography image pre-processing/calibrating/integrating.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Obligatory HomeLab Writeup</title>
      <link>https://damenknight.com/obligatory-homelab-writeup/</link>
      <guid isPermaLink="true">https://damenknight.com/obligatory-homelab-writeup/</guid>
      <pubDate>Sat, 15 Mar 2025 16:41:12 GMT</pubDate>
      <category>Homelab</category>
      <category>Projects</category>
      <description>Don’t judge my local fire-hazard. I’ll get this out of the way first: You do not want rack servers in your home. They’re *really* loud. I am just a crazy…</description>
      <content:encoded><![CDATA[
<figure><img decoding="async" width="810" height="1080" src="https://damenknight.com/images/image-810x1080.jpg" alt=""   /></figure>



<p class="has-text-align-center has-small-font-size">Don’t judge my local fire-hazard. </p>



<p>I’ll get this out of the way first: You do not want rack servers in your home.  They’re *<strong>really</strong>* loud.  I am just a crazy person.</p>



<p></p>



<p>Ok, now that that’s out of the way, let’s talk about what we’ve got.  Starting with the items actually installed in the rack from the bottom and working our way up:</p>



<details><summary>Synology RS2421+</summary>
<p><strong>Hardware:</strong><br>12x 18TB Seagate Exos HDDs (SHR)<br>2x 1TB SSD Cache (RAID 1)<br>32GB RAM<br>10Gbe NIC<br><br><strong>Services:</strong><br>DHCP<br>Temporary Web Hosting<br>NAS</p>



<p></p>



<p></p>
</details>



<details><summary>Dell PowerEdge R730</summary>
<p><strong>Hardware:</strong><br>2x 1TB SAS SSDs (RAID 1)<br>6x 1TB SAS SSDs (RAID 10)<br>768GB RAM<br>2x Xeon E5-2699(v4)<br>2x 10Gbe NIC<br>2x 1Gbe NIC<br>1x nVidia Tesla P4<br>1x nVidia Tesla T4<br><br><strong>Services:</strong><br>Proxmox<br>Ollama<br>Prod Web Hosting<br>Media Management<br>Frigate<br>WareWulf Cluster Management</p>
</details>



<details><summary>Dell PowerEdge R730xd</summary>
<p><strong>Hardware:</strong><br>2x 1TB SATA SSDs (ZFS Mirror)<br>12x 16TB Seagate Exos SATA HDDs (ZFS Striped Mirror)<br>768GB RAM<br>2x Xeon E5-2680(v3)<br>2x 10Gbe NIC<br>2x 1Gbe NIC<br>1x LSI SAS9305-16e HBA (attached to the PowerVault MD1400)<br><br><strong>Services:</strong><br>TrueNas<br>NAS<br>Immich<br>Minio<br>Tailscale<br>Uptime-Kuma<br>Postgres</p>



<p></p>
</details>



<details><summary>Dell PowerVault MD1400</summary>
<p><strong>Hardware</strong>:<br>6x 18TB Seagate Exos SAS HDDs (ZFS Striped Mirror)</p>
</details>



<details><summary>Juniper EX4550-32T-AFI</summary>
<p>10Gbe Managed Switch</p>
</details>



<details><summary>CyberPower CPS1215RM </summary>
<p>It’s a PDU, what do you want?</p>
</details>



<p>On top of the rack things get a little spicier</p>



<figure><img decoding="async" width="1920" height="1440" src="https://damenknight.com/images/image-1-1920x1440.jpg" alt=""   /></figure>



<details><summary>WareWulf Cluster</summary>
<p><strong>Hardware</strong>:<br>1x MeLE Quieter 3C MiniPC<br>3x MeLE Quieter 4C MiniPC<br>1x TRENDnet Switch (connects to 1Gbe NIC in the PowerEdge R730 for the WareWulf Controller)<br><br><strong>Services:</strong><br>Test environment for HPC<br><br><a href="https://damenknight.com/warewulf-home-lab-setup/">Details here.</a></p>
</details>



<details><summary>Plex Server</summary>
<p><strong>Hardware:</strong><br>BeeLink EQ12 MiniPC<br><br><strong>Services:</strong><br>Plex</p>
</details>



<details><summary>Home Assistant</summary>
<p><strong>Hardware:</strong><br>1x Home Assistant Green<br><br><strong>Services:</strong><br>Home Assistant</p>
</details>



<p>Moving beyond the rack entirely we have have the router/firewall</p>



<details><summary>Custom SFF PC</summary>
<p><strong>Hardware:</strong><br>1x Core i5-12400<br>64GB RAM<br>1x 2.5Gbe NIC (to internet)<br>1x 10Gbe NIC (to LAN)<br><br><strong>Services:</strong><br>PFSense<br>DNS<br>pfBlockerNG<br></p>
</details>



<p>Finally we have the ridiculous power distribution/UPS setup.  I wont go into too much detail here, but there are 2x 1500VA APC UPSes, and everything with redundant power is split between the two.  Those two UPSes are themselves plugged into a Jackery 5000 power station.</p>



<p>The rack itself is just a VEVOR 12U open frame rack.</p>



<p>I left out a few things – I’ve got a raspberry pi 5 running an allsky camera in my backyard (<a href="https://allsky.miscellaneousnerdery.com/">https://allsky.miscellaneousnerdery.com/</a>), there’s a sonos node, a smart things node, a philips node, a govee node, etc etc etc.  I may go into detail on those at some point, if I ever build out a full network map to share – it has gotten a lot more complicated since my last post.</p>



<p>In my next post I’ll go into the various services I’m running in more detail, along with how they’re all setup and what I’m using them for.  </p>



<p></p>
]]></content:encoded>
    </item>
    <item>
      <title>Mounting my miniPC and Powerbox</title>
      <link>https://damenknight.com/mounting-my-minipc-and-powerbox/</link>
      <guid isPermaLink="true">https://damenknight.com/mounting-my-minipc-and-powerbox/</guid>
      <pubDate>Mon, 29 Jan 2024 14:43:16 GMT</pubDate>
      <category>Astrophotography</category>
      <description>I recently got a Pegasus Pocket PowerBox Advanced for my astro setup, and while I was at it I moved my miniPC up to attach on the telescope instead of onto my…</description>
      <content:encoded><![CDATA[
<p>I recently got a Pegasus Pocket PowerBox Advanced for my astro setup, and while I was at it I moved my miniPC up to attach on the telescope instead of onto my mount.  </p>



<p>I did this using the accessories from BuckEye Stargazer.  Unfortunately, there are not a ton of guides or instructions available for how everything goes together so I went ahead and filmed a quick video going over the process.</p>



<figure><div>
<iframe loading="lazy" title="Mounting the Pegasus Astro Pocket Powerbox Advance (and miniPC) with the Buckeye Stargazer adapters" width="500" height="281" src="https://www.youtube.com/embed/wqQ-__FTluE?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>
]]></content:encoded>
    </item>
    <item>
      <title>Achieving better telescope balance</title>
      <link>https://damenknight.com/achieving-better-telescope-balance/</link>
      <guid isPermaLink="true">https://damenknight.com/achieving-better-telescope-balance/</guid>
      <pubDate>Thu, 25 Jan 2024 17:25:34 GMT</pubDate>
      <category>Astrophotography</category>
      <description>On the astrophotography Discord server I hang out on I’ve seen lots and lots of folks struggle with balancing their telescope. I think a big part of the reason…</description>
      <content:encoded><![CDATA[
<figure><div>
<iframe loading="lazy" title="Your telescope is out of balance and you dont even know it!" width="500" height="281" src="https://www.youtube.com/embed/dFjvplKKrKI?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>On the astrophotography Discord server I hang out on I’ve seen lots and lots of folks struggle with balancing their telescope.  I think a big part of the reason why this is so difficult is that a lot of the descriptions of what good balance is are vague or non-existent.  So I went ahead and made a quick video to help show what “good” balance looks like.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Network Setup</title>
      <link>https://damenknight.com/network-setup/</link>
      <guid isPermaLink="true">https://damenknight.com/network-setup/</guid>
      <pubDate>Mon, 22 Jan 2024 18:39:44 GMT</pubDate>
      <category>Homelab</category>
      <category>Projects</category>
      <description>WiFi: NetGear Orbi RBRE960 (AP mode, 3AP Mesh)Router: Custom MiniPC pfSense Router (10GBit LAN, 2.5GBit WAN)</description>
      <content:encoded><![CDATA[
<figure><img decoding="async" width="844" height="934" src="https://damenknight.com/images/network-diagram.png" alt=""   /></figure>



<p><strong>WiFi:</strong> NetGear Orbi RBRE960 (AP mode, 3AP Mesh)<br><strong>Router:</strong> Custom MiniPC pfSense Router (10GBit LAN, 2.5GBit WAN)<br></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
