Post-training quantization (PTQ) is an efficient model compression technique that quantizes a pretrained full-precision model using only a small calibration set of unlabeled samples without retraining. PTQ methods for convolutional neural networks (CNNs) provide quantization results comparable to full-precision counterparts. Directly applying them to vision transformers (ViTs), however, incurs severe performance degradation, mainly due to the differences in architectures between CNNs and ViTs. In particular, the distribution of activations for each channel vary drastically according to input instances, making PTQ methods for CNNs inappropriate for ViTs. To address this, we introduce instance-aware group quantization for ViTs (IGQ-ViT). To this end, we propose to split the channels of activation maps into multiple groups dynamically for each input instance, such that activations within each group share similar statistical properties. We also extend our scheme to quantize softmax attentions across tokens. In addition, the number of groups for each layer is adjusted to minimize the discrepancies between predictions from quantized and full-precision models, under a bit-operation (BOP) constraint. We show extensive experimental results on image classification, object detection, and instance segmentation, with various transformer architectures, demonstrating the effectiveness of our approach.

后训练量化（PTQ）是一种高效的模型压缩技术，它使用一个小的校准样本集对预训练的全精度模型进行量化，而无需重新训练。我们提出了一种针对视觉变换器（ViTs）的实例感知分组量化技术（IGQ-ViT），它动态地将激活图的通道分割为多个组，以使每个组内的激活具有相似的统计特性。我们的方法扩展到对令牌之间的 softmax 注意力进行量化，并通过调整每个层的组数来最小化量化模型与全精度模型之间的差异，在位运算约束下取得了良好效果。我们在图像分类、目标检测和实例分割等领域进行了广泛的实验证明了我们的方法的有效性。

视觉Transformer的实例感知组量化