The crossover

From the racing line to the roofline

On a track, I don't guess. I find the corner that's costing me time and trim it against the clock. GPUs are no different: a roofline model tells me the limiting factor before I write any code, then I build the kernel, measure it, and check the prediction against what really happened, including the times I was wrong.

I came up through mechanical engineering and ML research. At UC Santa Cruz I built Sconce, a model-compression tool: ~90% fewer parameters, 2.8× faster inference, accuracy held. That quantization work is what pulled me into low-precision attention. I'd never written an attention kernel before this. Now there are seventeen.

predict before you cut

The journey · v1 → v12

Seventeen kernels, one measured speedup at a time

Three eras, each with a different limiting factor. The roofline predicted the limiter before every step; the honest misses are on the record.

the dead-ends beat · Era 2

Coda

Arm 2 Stage A, native FP4 compute doesn't help decode

I pre-registered "it won't help," then measured it on a B300: FP4 reaches only 5.8% of its 15 PF peak, FP8 21.7% of 5 PF, both HBM-bound. The packing gate is a red herring for decode; low precision is a KV-cache capacity + accuracy lever, not a decode-speed one.

The receipts

A live benchmark run, correctness first, then the speedup and the counter-free %HBM that proves the limiter.

Every number on this site traces to a measured run in docs/results.md.

Results

What the curve says

memory & capacity wins

125×

S-matrix elimination

2164 MB → 17 MB (v3)

202×

MLA KV capacity vs MHA

measured (v11)

the roofline scorecard

Limiter location right most steps; magnitude missed 5 straight, the reason the model became two-layered.

Independent validation

It reproduces

An independent review re-ran the roofline tool and reproduced every published arithmetic-intensity number exactly, and cross-checked the external claims against current literature, the B300 constants, the decode-is-work-starved characterization, and the "complement, not beat" positioning all held.

"The misses are productive failures. Each one ruled out a wrong explanation and redirected me toward the real limiting factor."

Methodology · honest misses

The misses are the method

A pure FLOPs/bytes roofline is blind to the schedule. It named the limiter location right most steps but missed the wall-clock magnitude five straight, because L2 caching, occupancy, and launch overhead live in the schedule, not the arithmetic. Recording that honestly is the whole contribution.

"Deleting 99% of DRAM traffic made the kernel slower. Nothing was bandwidth-bound."

the S-elimination paradox (v3)

"The M=128 packing advantage is a red herring for decode, the shape stays memory-bound."

Arm 2 Stage A, measured KILL

"'Per-CTA-bound' was a statement about the engine, not the problem. The right engine converts work to throughput."

the v12 correction

The work, in figures

Contact

Let's build fast things

Open to ML-systems / GPU-kernel / research internships. The kernels are the foundation for a from-scratch mini-vLLM.

Email me

GitHub

github.com/gkienpham-cmd/flashattention-cuda

linkedin.com/in/gkienpham

gkienpham@gmail.com

primary bullet

Applied bottleneck-hunting from mechanical systems to GPU kernels, 17 CUDA kernels, 4 architectures, 8–16× over PyTorch, characterized NVIDIA's flagship B300.