Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications. These models, however, have considerable storage and computational overheads, making their deployment and efficient inference on edge devices challenging. Quantization is a promising approach to reducing model complexity; unfortunately, existing efforts to quantize ViTs are simulated quantization (aka fake quantization), which remains floating-point arithmetic during inference and thus contributes little to model acceleration. In this paper, we propose I-ViT, an integer-only quantization scheme for ViTs, to enable ViTs to perform the entire computational graph of inference with integer operations and bit-shifting and no floating-point operations. In I-ViT, linear operations (e.g., MatMul and Dense) follow the integer-only pipeline with dyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and LayerNorm) are approximated by the proposed light-weight integer-only arithmetic methods. In particular, I-ViT applies the proposed Shiftmax and ShiftGELU, which are designed to use integer bit-shifting to approximate the corresponding floating-point operations. We evaluate I-ViT on various benchmark models and the results show that integer-only INT8 quantization achieves comparable (or even higher) accuracy to the full-precision (FP) baseline. Furthermore, we utilize TVM for practical hardware deployment on the GPU's integer arithmetic units, achieving 3.72~4.11$\times$ inference speedup compared to the FP model.

本文提出I-ViT作为Vision Transformers的整数量化方案，在不使用浮点算数的情况下，通过整数算术和位移来完成计算图的整个计算过程，并使用Shiftmax和ShiftGELU等方法来近似非线性组件，以减少模型复杂性并提高在边缘设备上的有效性，实验结果表明整数量化达到与FP基线相当（甚至略高）的准确率，并且使用TVM在GPU的整数算术单元上实现了3.72-4.11倍的推断加速。

I-ViT：整数量化优化视觉Transformer推理