The transformer extends its success from the language to the vision domain.
Because of the stacked self-attention and cross-attention blocks, the
acceleration deployment of vision transformer on GPU hardware is challenging
and also rarely studied. This paper thoroughly designs a compression scheme to
maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and
quantization. Specially, an original large model with dense weight parameters
is first pruned into a sparse one by 2:4 structured pruning, which considers
the GPU's acceleration of 2:4 structured sparse pattern with FP16 data type,
then the floating-point sparse model is further quantized into a fixed-point
one by sparse-distillation-aware quantization aware training, which considers
GPU can provide an extra speedup of 2:4 sparse calculation with integer
tensors. A mixed-strategy knowledge distillation is used during the pruning and
quantization process. The proposed compression scheme is flexible to support
supervised and unsupervised learning styles. Experiment results show GPUSQ-ViT
scheme achieves state-of-the-art compression by reducing vision transformer
models 6.4-12.7 times on model size and 30.3-62 times on FLOPs with negligible
accuracy degradation on ImageNet classification, COCO detection and ADE20K
segmentation benchmarking tasks. Moreover, GPUSQ-ViT can boost actual
deployment performance by 1.39-1.79 times and 3.22-3.43 times of latency and
throughput on A100 GPU, and 1.57-1.69 times and 2.11-2.51 times improvement of
latency and throughput on AGX Orin.

本文通过应用 2:4 结构稀疏化与量化方法，并在裁剪和量化进程中使用混合策略知识蒸馏，设计了一种压缩方案，可以在几乎无损精度降低的情况下将视觉变压器模型的大小减少 6.4-12.7 倍并提高实际部署性能。

利用 GPU 友好的稀疏化和量化增强视觉 Transformer

Boost Vision Transformer with GPU-Friendly Sparsity and Quantization

Transformer-based pre-trained language models have significantly improved the
performance of various natural language processing (NLP) tasks in the recent
years. While effective and prevalent, these models are usually prohibitively
large for resource-limited deployment scenarios. A thread of research has thus
been working on applying network pruning techniques under the
pretrain-then-finetune paradigm widely adopted in NLP. However, the existing
pruning results on benchmark transformers, such as BERT, are not as remarkable
as the pruning results in the literature of convolutional neural networks
(CNNs). In particular, common wisdom in pruning CNN states that sparse pruning
technique compresses a model more than that obtained by reducing number of
channels and layers (Elsen et al., 2020; Zhu and Gupta, 2017), while existing
works on sparse pruning of BERT yields inferior results than its small-dense
counterparts such as TinyBERT (Jiao et al., 2020). In this work, we aim to fill
this gap by studying how knowledge are transferred and lost during the
pre-train, fine-tune, and pruning process, and proposing a knowledge-aware
sparse pruning process that achieves significantly superior results than
existing literature. We show for the first time that sparse pruning compresses
a BERT model significantly more than reducing its number of channels and
layers. Experiments on multiple data sets of GLUE benchmark show that our
method outperforms the leading competitors with a 20-times weight/FLOPs
compression and neglectable loss in prediction accuracy.

本论文研究在 NLP 领域中，对预训练的 Transformers 模型采取稀疏剪枝 (sparse pruning) 技术，相较于对其通道与层数的压缩，稀疏剪枝的效果更为显著。通过基于 GLUE 数据集的实验比较，证明本论文所采用的知识感知的稀疏剪枝方法可以实现 20 倍的参数 / FLOPs 压缩并且不会明显损失模型的性能。