Kien Pham · sconce
Project 04 · ML research · UC Santa Cruz Science Internship Program · Summer 2024

Sconce: end-to-end model compression

Deep networks grow faster than the hardware that runs them. Sconce is an AutoML package that compresses a model end-to-end with minimal human intervention: pruning, quantization, and sparsity in one pipeline, so large models deploy on small budgets.

72–75%
memory reduced
84–94%
parameters removed
2.8×
inference speedup
6
CNNs tested
The pipeline

Four stages, one pass

Each stage removes a different kind of redundancy. Chained, they eliminate most of a network's weight while the accuracy loss stays measured and bounded.

01
Channel-wise pruning
Whole convolutional channels that contribute little are removed. Structural, so the speedup is real on hardware.
02
Granular magnitude pruning
Individual near-zero weights are zeroed out, pushing parameter counts down 84–94% across the six test networks.
03
Quantization
Weights step down from FP32 to low-precision formats. This stage later became my edge for low-precision attention kernels.
04
Sparsity engines + CUDA fine-tune
Sparse execution plus GPU-accelerated fine-tuning recovers accuracy and lands the 2.8× inference speedup.
The honest number

Compression isn't free, and we say so

Post-compression accuracy dropped 5–10% before recovery; CUDA fine-tuning brought the compressed models back to ~93% of baseline accuracy. That trade (most of the model gone, a bounded accuracy cost, 2.8× faster inference) is the result, stated with its price.

where it led

The quantization fluency from Sconce became the foundation for the FP8 and NVFP4 attention kernels a year later. FlashAttention, from scratch →

Research abstract (PDF) ↗ Pruning & quantization results (PDF) ↗ with A. Chestovaliev · R. Albright · S. Narayanan · Dr. G. M. Muktadir
← Prev · Multispectral drone mapping Next · STEM in rural Vietnam →