Kien Pham · roofline
The crossover

From the racing line to the roofline

On a track, I don't guess. I find the corner that's costing me time and trim it against the clock. GPUs are no different: a roofline model tells me the limiting factor before I write any code, then I build the kernel, measure it, and check the prediction against what really happened, including the times I was wrong.

I came up through mechanical engineering and ML research. At UC Santa Cruz I built Sconce, a model-compression tool: ~90% fewer parameters, 2.8× faster inference, accuracy held. That quantization work is what pulled me into low-precision attention. I'd never written an attention kernel before this. Now there are seventeen.

predict before you cut
The journey · v1 → v12

Seventeen kernels, one measured speedup at a time

Three eras, each with a different limiting factor. The roofline predicted the limiter before every step; the honest misses are on the record.

the dead-ends beat · Era 2
Coda

Arm 2 Stage A, native FP4 compute doesn't help decode

I pre-registered "it won't help," then measured it on a B300: FP4 reaches only 5.8% of its 15 PF peak, FP8 21.7% of 5 PF, both HBM-bound. The packing gate is a red herring for decode; low precision is a KV-cache capacity + accuracy lever, not a decode-speed one.

The receipts

The receipts

A live benchmark run, correctness first, then the speedup and the counter-free %HBM that proves the limiter.

Every number on this site traces to a measured run in docs/results.md.

Results

What the curve says

memory & capacity wins
125×
S-matrix elimination
2164 MB → 17 MB (v3)
202×
MLA KV capacity vs MHA
measured (v11)
the roofline scorecard

Limiter location right most steps; magnitude missed 5 straight, the reason the model became two-layered.

Independent validation

It reproduces

An independent review re-ran the roofline tool and reproduced every published arithmetic-intensity number exactly, and cross-checked the external claims against current literature, the B300 constants, the decode-is-work-starved characterization, and the "complement, not beat" positioning all held.

"The misses are productive failures. Each one ruled out a wrong explanation and redirected me toward the real limiting factor."
Methodology · honest misses

The misses are the method

A pure FLOPs/bytes roofline is blind to the schedule. It named the limiter location right most steps but missed the wall-clock magnitude five straight, because L2 caching, occupancy, and launch overhead live in the schedule, not the arithmetic. Recording that honestly is the whole contribution.

"Deleting 99% of DRAM traffic made the kernel slower. Nothing was bandwidth-bound."
the S-elimination paradox (v3)
"The M=128 packing advantage is a red herring for decode, the shape stays memory-bound."
Arm 2 Stage A, measured KILL
"'Per-CTA-bound' was a statement about the engine, not the problem. The right engine converts work to throughput."
the v12 correction
The work, in figures

The work, in figures

Contact

Let's build fast things

Open to ML-systems / GPU-kernel / research internships. The kernels are the foundation for a from-scratch mini-vLLM.

Email me GitHub LinkedIn
GitHub
github.com/gkienpham-cmd/flashattention-cuda
LinkedIn
linkedin.com/in/gkienpham
Email
gkienpham@gmail.com
primary bullet

Applied bottleneck-hunting from mechanical systems to GPU kernels, 17 CUDA kernels, 4 architectures, 8–16× over PyTorch, characterized NVIDIA's flagship B300.