Project 04 · ML research · UC Santa Cruz Science Internship Program · Summer 2024

Sconce: end-to-end model compression

Deep networks grow faster than the hardware that runs them. Sconce is an AutoML package that compresses a model end-to-end with minimal human intervention: pruning, quantization, and sparsity in one pipeline, so large models deploy on small budgets.

72–75%

memory reduced

84–94%

parameters removed

2.8×

inference speedup

CNNs tested

The pipeline

Four stages, one pass

Each stage removes a different kind of redundancy. Chained, they eliminate most of a network's weight while the accuracy loss stays measured and bounded.

Channel-wise pruning

Whole convolutional channels that contribute little are removed. Structural, so the speedup is real on hardware.

Granular magnitude pruning

Individual near-zero weights are zeroed out, pushing parameter counts down 84–94% across the six test networks.

Quantization

Weights step down from FP32 to low-precision formats. This stage later became my edge for low-precision attention kernels.

Sparsity engines + CUDA fine-tune

Sparse execution plus GPU-accelerated fine-tuning recovers accuracy and lands the 2.8× inference speedup.

The honest number

Compression isn't free, and we say so

Post-compression accuracy dropped 5–10% before recovery; CUDA fine-tuning brought the compressed models back to ~93% of baseline accuracy. That trade (most of the model gone, a bounded accuracy cost, 2.8× faster inference) is the result, stated with its price.

where it led

The quantization fluency from Sconce became the foundation for the FP8 and NVFP4 attention kernels a year later. FlashAttention, from scratch →

Research abstract (PDF) ↗ Pruning & quantization results (PDF) ↗ with A. Chestovaliev · R. Albright · S. Narayanan · Dr. G. M. Muktadir

← Prev · Multispectral drone mapping Next · STEM in rural Vietnam →