The quadratic computational complexity to the number of tokens limits the
practical applications of Vision Transformers (ViTs). Several works propose to
prune redundant tokens to achieve efficient ViTs. However, these methods
generally suffer from (i) dramatic accuracy drops, (ii) application difficulty
in the local vision transformer, and (iii) non-general-purpose networks for
downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT),
for efficient global and local vision transformers, which can also be revised
to serve as backbone for downstream tasks. The semantic tokens represent
cluster centers, and they are initialized by pooling image tokens in space and
recovered by attention, which can adaptively represent global or local semantic
information. Due to the cluster properties, a few semantic tokens can attain
the same effect as vast image tokens, for both global and local vision
transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base)
can achieve the same accuracy with more than 100% inference speed improvement
and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16
semantic tokens in each window to further speed it up by around 20% with slight
accuracy increase. Besides great success in image classification, we also
extend our method to video recognition. In addition, we design a
STViT-R(ecover) network to restore the detailed spatial information based on
the STViT, making it work for downstream tasks, which is powerless for previous
token sparsification methods. Experiments demonstrate that our method can
achieve competitive results compared to the original networks in object
detection and instance segmentation, with over 30% FLOPs reduction for
backbone. Code is available at this http URL

本文介绍了一种基于语义标记的 ViT 模型，可以用于图像分类以及物体检测和实例分割等任务，并通过对空间中的池化图像标记进行 attention 的方法，来取代大量的图像标记，从而实现了网络的降维升效。