SQuaT: Self-Supervised Knowledge Distillation via Student-Aware Quantized Teacher Features
Abstract
Quantization-Aware Training (QAT) enables deployment of full-precision networks on resource-constrained hardware, yet most methods assume access to large labeled datasets. In practice, labels are scarce or expensive, motivating—but complicating—label-free QAT KD. Prior work such as SQAKD aligns teacher and student at the logit level, yielding limited representational transfer, while direct feature distillation is brittle due to the mismatch between full-precision teacher features and quantized student representations. We propose SQuaT (Student-Aware Quantized Teacher Features). The key idea is to apply the student’s quantization parameters to teacher feature maps, producing student-quantized teacher features, and to conduct feature-level distillation on the student’s quantization lattice; we additionally perform logit-level distillation, transferring knowledge at both representation and output levels. Beyond accuracy, we demonstrate on-device acceleration on real boards, confirming practical deployment gains, and we train label-free at ImageNet scale, showing that SQuaT scales to large datasets without annotations. Across diverse datasets, bit-widths, and hardware settings, SQuaT consistently outperforms SQAKD, with especially strong gains at ultra-low bit (1-bit) settings, establishing SQuaT as a robust and effective solution for label-free QAT KD. Our code is available at https://anonymous.4open.science/r/MLPR-45D6/