Quantization emerges as one of the most promising approaches for deploying advanced deep models on resource-constrained hardware. Mixed-precision quantization leverages multiple bit-width architectures to unleash the accuracy and efficiency potential of quantized models. However, existing mixed-precision quantization suffers exhaustive search space that causes immense computational overhead. The quantization process thus relies on separate high-performance devices rather than locally, which also leads to a significant gap between the considered hardware metrics and the real deployment.In this paper, we propose an On-chip Hardware-aware Quantization (OHQ) framework that performs hardware-aware mixed-precision quantization without accessing online devices. First, we construct the On-chip Quantization Awareness (OQA) pipeline, enabling perceive the actual efficiency metrics of the quantization operator on the hardware.Second, we propose Mask-guided Quantization Estimation (MQE) technique to efficiently estimate the accuracy metrics of operators under the constraints of on-chip-level computing power.By synthesizing network and hardware insights through linear programming, we obtain optimized bit-width configurations. Notably, the quantization process occurs on-chip entirely without any additional computing devices and data access. We demonstrate accelerated inference after quantization for various architectures and compression ratios, achieving 70% and 73% accuracy for ResNet-18 and MobileNetV3, respectively. OHQ improves latency by 15~30% compared to INT8 on deployment.

本文提出了一种在芯片上进行硬件感知的混合精度量化（OHQ）框架，通过构建在芯片上的量化感知管道（OQA）和基于掩码的量化估计（MQE）技术，实现了从硬件感知的混合精度量化。通过合成网络和硬件的见解，通过线性规划获得了优化的位宽配置。OHQ在完全无需额外的计算设备和数据访问的情况下，对各种体系结构和压缩比率进行了量化推理，为ResNet-18和MobileNetV3分别实现了70％和73％的准确率，并且相较于部署中的INT8，减少了15～30％的延迟。

OHQ: 在芯片上的硬件感知量化