Token pruning has emerged as an effective solution to speed up the inference of large Transformer models. However, prior work on accelerating Vision Transformer (ViT) models requires training from scratch or fine-tuning with additional parameters, which prevents a simple plug-and-play. To avoid high training costs during the deployment stage, we present a fast training-free compression framework enabled by (i) a dense feature extractor in the initial layers; (ii) a sharpness-minimized model which is more compressible; and (iii) a local-global token merger that can exploit spatial relationships at various contexts. We applied our framework to various ViT and DeiT models and achieved up to 2x reduction in FLOPS and 1.8x speedup in inference throughput with <1% accuracy loss, while saving two orders of magnitude shorter training times than existing approaches. Code will be available at https://github.com/johnheo/fast-compress-vit

提出优化Transformer模型(ViT)部署过程中训练代价高的问题的快速无需训练压缩框架，其中包括初层的稠密特征提取器、压缩率更高的模型和利用空间关系的局部-全局令牌合并方法，在多个模型上实现了至多2倍的FLOPS减少和1.8倍的推理吞吐量提升，训练时间比现有方法节省两个数量级。

一种用于Vision Transformer的快速无需训练的压缩框架