On a track, I don't guess. I find the corner that's costing me time and trim it against the clock. GPUs are no different: a roofline model tells me the limiting factor before I write any code, then I build the kernel, measure it, and check the prediction against what really happened, including the times I was wrong.
I came up through mechanical engineering and ML research. At UC Santa Cruz I built Sconce, a model-compression tool: ~90% fewer parameters, 2.8× faster inference, accuracy held. That quantization work is what pulled me into low-precision attention. I'd never written an attention kernel before this. Now there are seventeen.
Three eras, each with a different limiting factor. The roofline predicted the limiter before every step; the honest misses are on the record.
I pre-registered "it won't help," then measured it on a B300: FP4 reaches only 5.8% of its 15 PF peak, FP8 21.7% of 5 PF, both HBM-bound. The packing gate is a red herring for decode; low precision is a KV-cache capacity + accuracy lever, not a decode-speed one.
A live benchmark run, correctness first, then the speedup and the counter-free %HBM that proves the limiter.
Every number on this site traces to a measured run in docs/results.md.
Limiter location right most steps; magnitude missed 5 straight, the reason the model became two-layered.
An independent review re-ran the roofline tool and reproduced every published arithmetic-intensity number exactly, and cross-checked the external claims against current literature, the B300 constants, the decode-is-work-starved characterization, and the "complement, not beat" positioning all held.
"The misses are productive failures. Each one ruled out a wrong explanation and redirected me toward the real limiting factor."
A pure FLOPs/bytes roofline is blind to the schedule. It named the limiter location right most steps but missed the wall-clock magnitude five straight, because L2 caching, occupancy, and launch overhead live in the schedule, not the arithmetic. Recording that honestly is the whole contribution.
"Deleting 99% of DRAM traffic made the kernel slower. Nothing was bandwidth-bound."
"The M=128 packing advantage is a red herring for decode, the shape stays memory-bound."
"'Per-CTA-bound' was a statement about the engine, not the problem. The right engine converts work to throughput."
Open to ML-systems / GPU-kernel / research internships. The kernels are the foundation for a from-scratch mini-vLLM.
Applied bottleneck-hunting from mechanical systems to GPU kernels, 17 CUDA kernels, 4 architectures, 8–16× over PyTorch, characterized NVIDIA's flagship B300.