TL;DR: I found a bug in llama.cpp. It’s a one-character cross-backend bug in ggml: the step() operator returns 1 at exactly zero on the Vulkan (GPU) backend (x >= 0) but 0 on the CPU (x > 0). It’s filed (#25027), a maintainer wrote the fix (PR #25036), and it’s approved and waiting to merge.

I found it the hard way: building sparse-attention prefill on my “cheap” AMD Strix Halo miniPC, I hit a slowdown where there should have been a 2x speedup, and spent the better part of a week blaming the graph scheduler, the allocator, an O(n^2) re-sweep, and buffer aliasing, before realizing my “drop this block” mask, built out of step(), was quietly keeping every block because step(0) was wrong on the GPU, so FlashAttention read all of it anyway.

Just enough setup to make the bug land

To explain what the bug was I have to back up to what I was building…

I’ve been trying to make a long context usable on a “cheap” Strix Halo box (a Ryzen AI Max+ 395 with the gfx1151 iGPU) - yes I know, I am a crazy person. That box is bandwidth-bound so the only real approach is moving fewer bytes, and during a long prefill the way you do that is sparse attention: don’t read the parts of the KV cache that don’t matter.

The neat thing about doing this in llama.cpp is that you don’t have to write a kernel. The Vulkan FlashAttention already has a tile-skip (PR #19281): before it loads a tile of K and V, it checks whether that tile’s slice of the attention mask is entirely -inf, and if so it skips the tile, the load, and the compute. The entire feature reduces to one thing, computing a good mask. Mark a block -inf and its bytes never move.

To decide which blocks to keep I built a boolean: keep a block if it scores as content-salient or it falls in a recent window. And I built that OR the way you build booleans when you’re working in a tensor graph that has no if: out of arithmetic, with the Heaviside step function. step(x) is 1 for positive x and 0 otherwise, so keep = step(content_score + in_window) is a serviceable OR, and step of zero is supposed to be the “neither” case that evaluates to drop.

Turns out that “supposed to” in the previous sentence was doing a lot of heavy lifting.

The symptom: drop work, get… slower???

With the mask in place, the thing got slower. What the heck? At a 131K-token prefill, baseline throughput was 473 t/s; my sparse version did 372. That’s confusing, because the mechanism is supposed to be strictly subtractive. I’m dropping blocks. FlashAttention should be reading less. Dropping work and getting a slowdown means something is very wrong, and I had a mask that looked correct, output that was coherent, retrieval that passed my tests, and a number going the wrong way.

A week of being wrong

None of these were dumb guesses. Every one had a real mechanism behind it, got measured, and got fixed, and not one of them was the problem (RIP me, I guess, but will be useful later!). The graveyard, in order:

Wrong theory #1: a CPU fallback. My block-scoring step pooled the keys with ggml_pool_1d, and the Vulkan backend has no 1D pooling kernel, only 2D. So every micro-batch, that op silently fell back to the CPU, with a full GPU->CPU->GPU round trip and a sync each time. That’s real, and I confirmed it three ways (source, a decomposition that showed a fixed per-ubatch tax independent of how much I dropped, and a scheduler trace). I rewrote it as a ggml_pool_2d over the token axis. It helped: 0.79x became 0.94x. Still a slowdown.

Wrong theory #2: an O(n^2) re-sweep. The graph is stateless per micro-batch, so each of the ~256 prefill ubatches re-pools all of the keys accumulated so far. Sum that over the prefill and you get an O(n^2) re-sweep of K, the same order as FlashAttention’s own read. I built a model that fit all my data points and it was beautiful. The fix is to amortize: cache each block’s representative once, when its tokens are written to the KV cache, and read the cache instead of re-pooling. A persistent per-layer rep buffer parallel to the KV cache, incremental updates on the write path, the whole thing. It compiled, it retrieved correctly, and it bought zero speed. The beautiful model was wrong.

Wrong theory #3: FlashAttention isn’t skipping at all. By now I’d stopped trusting my assumptions and turned on the Vulkan performance logger. It settled it: FlashAttention was dominating the trace, reading the full K, every layer, the entire time. The sparse machinery I’d been agonizing over wasn’t even in the top 25 ops. The keep-nothing case (an all--inf mask, which should be the maximum possible skip) ran at 390 t/s, slower than baseline. If the skip were firing, that case would be 2x. It wasn’t firing. At all. For any version. Every “sparse” run I’d done was running dense attention plus the overhead of computing a mask nobody used.

That reframed the hunt. The mask was being computed and added, and I could prove it was logically -inf, because a keep-nothing mask makes the model attend over nothing and produce NaNs. But FlashAttention wouldn’t tile-skip it. Meanwhile a simpler fixed-window mask, the one I’d built first and fed in as a plain input tensor, skipped fine. Same kernel, same model, one mask skips and one doesn’t.

Wrong theory #4: the graph scheduler / buffer aliasing. I narrowed it hard. A diagnostic that forced the mask to all--inf through a short op chain skipped cleanly (>2x, the win was provably there). My long dynamic chain produced a mask that matched the short one in shape, stride, type, and contiguity, and as far as I could tell held the same -inf values, and yet FlashAttention skipped one and not the other. The only difference was the graph that produced the tensor. From there it really looked like a scheduler or graph-allocator bug: a missing barrier between my long producer chain and the FlashAttention consumer, or a buffer getting aliased and overwritten before FA’s skip pre-pass read it. I had a whole “the deeply-derived tensor lands in a reused buffer” story, complete with a clean-buffer workaround that happened to make it skip. It was, again, plausible, and again not the cause.

The pattern across all of it: I spent days on re-sweep math, op fusion, mask layout, sync barriers, and buffer lifetimes. Every one of those was a sophisticated guess about systems behavior, and every one was wrong, because the bug wasn’t in the systems. It was in arithmetic.

The actual bug

What all my workarounds had in common, the reason “force the mask through a short chain” or “drop into a freshly-filled buffer” made it skip, is that they all bypassed the math that built the keep/drop decision. I’d been treating that math as incidental, when it was the whole bug.

Recall the OR: keep = step(content_score + in_window), where step is 1 for positive and 0 otherwise. The intent is “keep this block if it’s content-salient or it’s in the window.” The “neither” case, score zero and not in the window, is supposed to land on step(0) and evaluate to drop.

step(0) does not return 0 on the Vulkan backend. It returns 1.

Here are the two implementations, which are supposed to compute the same function:

// CPU reference: ggml/src/ggml-cpu/vec.h, ggml_vec_step_f32
y[i] = (x[i] > 0.f) ? 1.f : 0.f;     // step(0) = 0

// Vulkan: ggml/src/ggml-vulkan/vulkan-shaders/unary.comp, op_step
return x >= 0.0f ? 1.0f : 0.0f;      // step(0) = 1

> on the CPU, >= on the GPU. At exactly zero they disagree by the maximum possible amount. And boolean-indicator math sits on exactly-zero constantly, because that’s what “this condition is false” is. So on the GPU, my “neither” case, the blocks I most wanted to drop, evaluated to step(0) = 1 = keep. Every block was kept. The mask was effectively all-zeros. FlashAttention dutifully read every tile, because I’d told it to, and I’d spent a week blaming the scheduler for honoring my own broken arithmetic.

It survives undetected because you only hit it at an input of exactly 0.0, and the upstream test for this op never does. test-backend-ops initializes its inputs with uniform random floats in [-150, 150], which never land precisely on zero, so the STEP test passes green while the op is wrong. I forced the input to exactly 0 in a one-line edit and reran it:

[STEP] ERR = 1.000000000 > 0.000000100   STEP(type=f16,...,v=0): FAIL
[STEP] ERR = 1.000000000 > 0.000000100   STEP(type=f32,...,v=0): FAIL
0/4 tests passed   Backend Vulkan0: FAIL

ERR of exactly 1.0: CPU says 0, Vulkan says 1. Both precisions. The fix is one character, >= to >, to match the CPU reference. My local workaround, until that lands, is to bias the OR by 0.5 so the boundary is never at zero: step((a + b) - 0.5), which is robust no matter which way step(0) falls. I filed it upstream with the repro and the one-line fix: ggml-org/llama.cpp#25027. A Vulkan-backend maintainer picked it up and wrote the actual patch, PR #25036, >= to >, approved by two maintainers and waiting to merge as I write this.

“step(0) is off by one” sounds like a corner case, and for most llama.cpp users it is one: step barely appears in mainstream model graphs, so nobody’s chat session was silently wrong because of this. The cost is narrower and meaner than that. If you build boolean or gating logic out of step on the Vulkan backend, the way I did, your graph silently disagrees with the identical graph on the CPU, at exactly the input boolean logic lands on constantly: the false case, where the indicators sum to zero. Nothing crashes and nothing warns. It just quietly does the opposite of what you wrote, on one backend only.

That is also what fixing it means. Once the patch lands, step on Vulkan agrees with the CPU, I drop the -0.5 hack, and the mask works without a workaround. The broader win is quieter than a speedup: ggml has one fewer place where the same graph returns two different answers on two backends. A portable op that isn’t actually portable is worse than no op at all, because you trust it. Now it matches the CPU, like it always should have.

The thing that still stings: every fix I built in the graveyard above was correct. The pool_2d rewrite, the rep cache, the clean mask construction, they were all better, and they all paid off the instant the mask started dropping blocks. I just built them to fix problems that were downstream of a one-character lie about whether zero is positive.

Exactly how much each of those buys, now that the skip actually fires, is a writeup of its own. I’ll dig into the pool_1d-to-pool_2d fix, the amortized rep cache, and the rest of the speed work in a follow-up post. This one’s about the bug.

The lesson I’ll actually remember

If a compute graph validates fine on the CPU and misbehaves only on a GPU backend, suspect a per-op semantic mismatch first, before scheduling, fusion, barriers, or memory. Backends are supposed to compute the same function and almost always do, which is exactly why the rare place they don’t is so expensive: every instinct sends you to the systems layer, where the behavior is, instead of the arithmetic, where the bug is. The tell, in hindsight, was that all my workarounds worked by avoiding the math, and I read that as “the systems path is fragile” when it meant “the math is wrong.” A week, for >= versus >.

It generalizes past this one shader. The same class of thing, the same op computing subtly different numbers depending on where it runs, is exactly what the determinism people keep finding (Thinking Machines’ writeup on batch-variance is the recent canonical one), and it’s why the boring discipline pays: when you build boolean logic out of arithmetic, the boundary value is a real input, not an edge case, and it’s the one a random test generator never produces on its own. The whole thing was avoidable, and I’d have caught it on day one if I’d doubted the easy part instead of the hard part.