Deep networks grow faster than the hardware that runs them. Sconce is an AutoML package that compresses a model end-to-end with minimal human intervention: pruning, quantization, and sparsity in one pipeline, so large models deploy on small budgets.
Each stage removes a different kind of redundancy. Chained, they eliminate most of a network's weight while the accuracy loss stays measured and bounded.
Post-compression accuracy dropped 5–10% before recovery; CUDA fine-tuning brought the compressed models back to ~93% of baseline accuracy. That trade (most of the model gone, a bounded accuracy cost, 2.8× faster inference) is the result, stated with its price.
The quantization fluency from Sconce became the foundation for the FP8 and NVFP4 attention kernels a year later. FlashAttention, from scratch →