Vision transformers have shown excellent performance in computer vision tasks. However, the computation cost of their (local) self-attention mechanism is expensive. Comparatively, CNN is more efficient with built-in inductive bias. Recent works show that CNN is promising to compete with vision transformers by learning their architecture design and training protocols. Nevertheless, existing methods either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on MCA, we present a neural network named ConvFormer. ConvFormer adopts the general architecture of vision transformers, while replacing the (local) self-attention mechanism with our proposed MCA. Extensive experimental results demonstrated that ConvFormer outperforms similar size vision transformers(ViTs) and convolutional neural networks (CNNs) in various tasks. For example, ConvFormer-S, ConvFormer-L achieve state-of-the-art performance of 82.8%, 83.6% top-1 accuracy on ImageNet dataset. Moreover, ConvFormer-S outperforms Swin-T by 1.5 mIoU on ADE20K, and 0.9 bounding box AP on COCO with a smaller model size. Code and models will be available.

本文提出了一种动态多级注意力机制(DMA)，它通过多个卷积核大小捕捉输入图像的不同模式，并通过门控机制实现输入自适应权重，然后提出了一种名为DMFormer的有效骨干网络，该网络采用了DMA替代了视觉变换器中的自我关注机制。在ImageNet-1K和ADE20K数据集上的广泛实验结果表明，DMFormer具有先进的性能，优于大小相似的视觉变压器(ViTs)和卷积神经网络(CNNs)。

DMFormer：缩小CNN和Vision Transformer之间的差距