MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode

MoonMath AI team has released a bf16 forward attention kernel for AMD’s MI300X GPU. It is written in HIP, not hand-written assembly. The code is open-source under the MIT license. The MoonMath.ai team reports it beats AITER v3, AMD’s own optimized kernel, on every tested shape. Bare-metal access came from HotAisle, an AMD cloud provider.

Attention is the fused softmax(QKᵀ/√d)·V operation inside every transformer. The MI300X is AMD’s CDNA3 data-center GPU, with the ISA target (gfx942). This kernel runs on that hardware only.

TL;DR

MoonMath.ai open-sources a bf16 forward attention kernel for AMD MI300X, written in HIP, not assembly (MIT).
It beats AMD’s AITER v3 on every shape and rounding mode — geomean 1.18×/1.15×/1.08×, up to 1.26×.
The core trick: one-instruction asm wrappers let you pick the opcode while the compiler allocates registers.
Most of the speedup is memory placement — K in LDS, V hot in L1, Q and accumulators in registers.
A real SGLang PR used it to speed up Wan2.1 video diffusion by 1.23×, with no quality regression.

Understanding Kernel

A kernel is a small program that runs directly on the GPU’s many cores to perform one specific computation—here, the attention math—as fast as the hardware allows. The kernel computes forward attention in bf16 on MI300X only. It takes inputs in either BSHD or BHSD layout, with no transpose. Head dimension is fixed at 128. It supports any sequence length, including cross-attention.

There are real limits. There is no causal mask, no GQA, and no varlen batching. Outputs are bf16, and it runs on gfx942 hardware exclusively.

Numerics are tightly controlled. All three rounding modes match AITER’s per-mode rounding rule. Every finite output sits within 1 bf16 ULP of AITER. NaN and Inf handling is bit-identical, and results are deterministic.

The Core Trick: One-Instruction asm Wrappers

The core technique avoids a familiar dilemma. Compiler intrinsics keep code tidy but let the compiler reorder or rename operands. Raw inline assembly gives control but forces manual register and address management.

MoonMath wraps exactly one instruction in a __device__ __forceinline__ function. Extended asm constraints describe the operands. The research team picks the opcode. The compiler still allocates registers and tracks data flow.

// in/out tied to the SAME VGPR → no accumulator rename, no v_mov copy.
__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {
    asm volatile("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"
                 : "+v"(c) : "v"(a), "v"(b));
}

The "+v"(c) constraint ties the accumulator input and output to the same VGPR. No copy instruction is emitted. This keeps the kernel close to ordinary HIP. It still steers the machine one instruction at a time.

The Architecture: Eight Waves, Two Groups, Two Barriers

A CDNA3 compute unit has four SIMD units. The textbook block is four waves. MoonMath instead runs eight waves per block, in two groups of four.

The two groups run the same Q*K, softmax, O += P*V sequence. They are offset by a phase. While one group saturates the matrix core, the other runs softmax and issues loads. Then they swap, so the matrix core never idles.

There are two s_barriers per iteration. One sits at the phase handoff. One sits at the iteration boundary. Per-counter waits handle the rest of the synchronization.

This echoes FlashAttention-3’s matmul and softmax alternation. It does not copy FA3’s producer and consumer warp split. On CDNA3, every memory move is already asynchronous, so a dedicated producer wave is unnecessary.

Where Data Lives, and Why 16×16×16

Most of the speedup comes from memory placement. K streams from HBM into LDS, double-buffered, shared by all eight waves. V stays hot in L1, read on every PV matmul. Q and accumulators live in registers.

The research team picked the 16×16×16 MFMA over 32×32×8. Both shapes have identical throughput. The smaller tile accumulates into 4 fp32 elements per lane, against 16. Lower accumulator pressure leaves room for deeper prefetch and a third Q tile.

Decision	Choice	Reason
Waves per block	8 (two groups of 4)	Plan the pipeline directly; share one K copy
MFMA shape	16×16×16 bf16	Same throughput, lower VGPR pressure, better power efficiency
K placement	LDS, double-buffered, 32 KiB	Shared by all 8 waves, swapped per iteration
V placement	L1, resident, prefetched	Reread across PV, kept hot deliberately
Q + accumulators	VGPRs	Read every iteration, never reloaded

Two later wins close the gap. A third Q tile (3Q) raises data reuse per loaded K and V tile. A Flash-Decoding-style tail KV split rescues the stranded fractional round across MI300X’s 304 CUs. These wins cascade. Moving V to L1 freed the LDS that the third Q tile then fills.

Benchmark

Tests ran on MI300X in bf16, head dimension 128. Each shape was measured at three rounding modes. RTNE rounds to nearest even. RTNA rounds to nearest, ties away from zero. RTZ truncates toward zero.

Shape (B, H, S, D)	Round	Ours (ms)	AITER v3 (ms)	vs AITER	vs MAX
(2, 24, 8192, 128)	RTNE	3.083	3.792	1.23×	1.37×
(2, 24, 16384, 128)	RTNE	11.670	14.691	1.26×	1.54×
(4, 16, 16384, 128)	RTZ	15.055	16.183	1.07×	1.47×
(2, 24, 32768, 128)	RTNA	44.440	52.363	1.18×	1.57×
(1, 16, 131072, 128)	RTNE	232.517	269.278	1.16×	1.46×

Geomeans across the sweep favor MoonMath. Versus AITER, it scores 1.18× (RTNE), 1.15× (RTNA), and 1.08× (RTZ). Versus Modular MAX, geomeans run 1.44× to 1.49×, and per-shape speedups reach 1.59×.

RTZ is AITER’s own fastest mode and the tightest race. The (4, 16, 16384) RTZ shape moved from 0.95× to 1.07×. The tail KV split is what closed that final gap.

Interactive Explainer

<![CDATA[<![CDATA[<![CDATA[ B across the two phase columns. var phase=0, iter="N", playing=false, timer=null, spd=950; // [phase1 cell, phase2 cell] for each lane — content is fixed var CELLS={ a:[{c:'mc',t:'PV · QK',tag:'matrix core'},{c:'mem',t:'softmax · V→L1',tag:'memory'}], b:[{c:'mem',t:'K→LDS · softmax',tag:'memory'},{c:'mc',t:'PV · QK',tag:'matrix core'}], t:[{t:'K: HBM→LDS'},{t:'V prefetch→L1'}] }; function setSlot(id,cell){var el=$('#'+id);el.className="slot role "+cell.c;el.innerHTML='<b>‘+cell.t+’</b><span class="tag">‘+cell.tag+’</span>‘;} function paintStatic(){ setSlot(‘a0’,CELLS.a[0]);setSlot(‘a1’,CELLS.a[1]); setSlot(‘b0’,CELLS.b[0]);setSlot(‘b1’,CELLS.b[1]); $(‘#t0’).innerHTML=CELLS.t[0].t;$(‘#t1’).innerHTML=CELLS.t[1].t; } function paintPipe(){ // phase 0 -> col0 active (a0,b0,t0); phase 1 -> col1 active (a1,b1,t1) var act=phase, dim=phase^1; [[‘a0′,’a1’],[‘b0′,’b1’]].forEach(function(pair){ $(‘#’+pair[act]).classList.remove(‘dim’);$(‘#’+pair[act]).classList.add(‘active’); $(‘#’+pair[dim]).classList.add(‘dim’);$(‘#’+pair[dim]).classList.remove(‘active’); }); [‘t0′,’t1’].forEach(function(id,i){$(‘#’+id).className=”seg2″+(i===act?’ flow’:’ dim’);}); $(‘#ph1′).className=”ph”+(phase===0?’ cur’:”); $(‘#ph2′).className=”ph”+(phase===1?’ cur’:”); $(‘#mm-iter’).textContent=”iteration “+iter+’ · phase ‘+(phase+1)+(phase===0?’ — Group A on matrix core’:’ — Group B on matrix core’); } function adv(){ phase++; if(phase>1){phase=0;iter=(iter===’N’)?’N+1′:(iter===’N+1′?’N+2′:’N’);} paintPipe(); } $(‘#mm-step’).addEventListener(‘click’,function(){stop();adv();}); function play(){playing=true;$(‘#mm-play’).innerHTML=’❚❚ Pause’;$(‘#mm-play’).classList.remove(‘primary’); timer=setInterval(adv,spd);} function stop(){playing=false;$(‘#mm-play’).innerHTML=’► Play’;$(‘#mm-play’).classList.add(‘primary’); if(timer){clearInterval(timer);timer=null;}} $(‘#mm-play’).addEventListener(‘click’,function(){playing?stop():play();}); $(‘#mm-spd’).addEventListener(‘input’,function(){spd=1950-(+this.value);if(playing){stop();play();}}); paintStatic();paintPipe(); /* —- memory map —- */ var MEM={ q:{t:’Q tile — VGPRs, persistent’,b:’The Q tile is read every iteration and never reloaded, so it stays resident in the vector register file. Two of three Q tiles per wave stay register-resident and hot.’}, acc:{t:’scores · O — fp32 accumulators in VGPRs’,b:’Matrix-core outputs (the score matrix and the running output) never leave registers until the final store. The 16×16×16 MFMA accumulates into just 4 fp32 elements per lane, keeping accumulator pressure low.’}, k:{t:’K tile — LDS, double-buffered, 32 KiB’,b:’One copy of K is shared by all eight waves and swapped per iteration via a double buffer. K streams from HBM straight into LDS by direct DMA, never passing through a VGPR. An XOR swizzle breaks bank conflicts with zero padding.’}, q3:{t:’3rd Q tile — LDS, 32 KiB, streamed’,b:’Moving V to L1 freed 32 KiB of LDS. The kernel spends it on a third Q tile (48 q-rows per wave). It is parked in LDS and streamed through a ping-pong buffer during the QK matmul, raising K/V reuse.’}, v:{t:’V_t tile — L1, resident’,b:’The pre-transposed V tile is kept hot in L1 and reread on every PV matmul. L1 is not addressable, so residency is engineered by prefetching the next iteration\u2019s lines into a throwaway register — the data lands in L1 as a side effect.’}, src:{t:’K / V source — HBM, staged via L2′,b:’A head-first chiplet swizzle maps all of a (batch, head)\u2019s Q blocks onto a single XCD, so its K and V stay resident in that XCD\u2019s slice of L2 instead of thrashing across all eight.’} }; function showMem(k){ $all(‘.chip’).forEach(function(c){c.setAttribute(‘aria-pressed’, c.getAttribute(‘data-k’)===k ? ‘true’:’false’);}); $(‘#mm-detail .dt’).textContent=MEM[k].t; $(‘#mm-detail .db’).textContent=MEM[k].b; reportHeight(); } $all(‘.chip’).forEach(function(c){c.addEventListener(‘click’,function(){showMem(c.getAttribute(‘data-k’));});}); /* —- init —- */ drawBench();showMem(‘q’); /* —- auto-resize for WordPress embed —- */ function reportHeight(){ var h=root.offsetHeight+40; if(window.parent){window.parent.postMessage({type:’mm-cdna3-height’,height:h},’*’);} } window.addEventListener(‘load’,reportHeight); window.addEventListener(‘resize’,reportHeight); setTimeout(reportHeight,300);setTimeout(reportHeight,900); })(); </script> </div> </body></html>“>]]]]]]><![CDATA[><![CDATA[>]]]]><![CDATA[>]]>

Use Cases

The kernel installs with pip and exposes a small API. It launches on the caller’s stream, so it overlaps inside larger pipelines.

import torch
import moonmath_attention as ma

# PyTorch's ROCm build uses the "cuda" device string on AMD GPUs
q = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")
k = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")
v = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda")

out     = ma.forward(q, k, v, layout="bshd")
out_rtz = ma.forward(q, k, v, layout="bshd", round_mode="rtz")

One concrete use case is video diffusion. The team added LiteAttention support and sent a PR to SGLang diffusion. On Wan2.1-T2V-1.3B-Diffusers, they switched attention from AITER to liteattention_rocm. End-to-end generation improved by 1.23× on MI300X, with no visible quality regression.

How to Design Python-First Interactive Dashboards with Prefab Reactive UI Components and Static HTML Export

The 7 Types of Agent Memory: A Technical Guide for AI Engineers

The BSHD layout suits diffusion tensors directly. Cross-attention works with any KV length and no padding.

Key Takeaways

The kernel is bf16 forward attention for MI300X, written in HIP under MIT.
It beats AITER v3 on every shape and rounding mode, geomean 1.18×/1.15×/1.08×.
One-instruction asm wrappers give opcode control while the compiler allocates registers.
Memory placement drove most of the gain: K in LDS, V hot in L1, Q in registers.
A real SGLang PR sped up Wan2.1 video diffusion by 1.23× with no quality regression.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Credit: Source link