Self-attention-based vision transformers (ViTs) have emerged as a highly
competitive architecture in computer vision. Unlike convolutional neural
networks (CNNs), ViTs are capable of global information sharing. With the
development of various structures of ViTs, ViTs are increasingly advantageous
for many vision tasks. However, the quadratic complexity of self-attention
renders ViTs computationally intensive, and their lack of inductive biases of
locality and translation equivariance demands larger model sizes compared to
CNNs to effectively learn visual features. In this paper, we propose a
light-weight and efficient vision transformer model called DualToken-ViT that
leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the
token with local information obtained by convolution-based structure and the
token with global information obtained by self-attention-based structure to
achieve an efficient attention structure. In addition, we use position-aware
global tokens throughout all stages to enrich the global information, which
further strengthening the effect of DualToken-ViT. Position-aware global tokens
also contain the position information of the image, which makes our model
better for vision tasks. We conducted extensive experiments on image
classification, object detection and semantic segmentation tasks to demonstrate
the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of
different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G
FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using
global tokens by 0.7%.

提出了一种轻量级和高效的视觉变换模型 DualToken-ViT，它通过卷积和自注意结构有效地融合了局部信息和全局信息以实现高效的注意力结构，并使用位置感知的全局标记来丰富全局信息，并改进了图像的位置信息，通过在图像分类、物体检测和语义分割任务上进行广泛实验，展示了 DualToken-ViT 的有效性，其在 ImageNet-1K 数据集上取得了 75.4% 和 79.4% 的准确率，而在只有 0.5G 和 1.0G 的 FLOPs 下，我们的 1.0G FLOPs 的模型的性能超过了使用全局标记的 LightViT-T 模型 0.7%。