This paper introduces a novel attention mechanism, called dual attention,
which is both efficient and effective. The dual attention mechanism consists of
two parallel components: local attention generated by Convolutional Neural
Networks (CNNs) and long-range attention generated by Vision Transformers
(ViTs). To address the high computational complexity and memory footprint of
vanilla Multi-Head Self-Attention (MHSA), we introduce a novel Multi-Head
Partition-wise Attention (MHPA) mechanism. The partition-wise attention
approach models both intra-partition and inter-partition attention
simultaneously. Building on the dual attention block and partition-wise
attention mechanism, we present a hierarchical vision backbone called
DualFormer. We evaluate the effectiveness of our model on several computer
vision tasks, including image classification on ImageNet, object detection on
COCO, and semantic segmentation on Cityscapes. Specifically, the proposed
DualFormer-XS achieves 81.5\% top-1 accuracy on ImageNet, outperforming the
recent state-of-the-art MPViT-XS by 0.6\% top-1 accuracy with much higher
throughput.

本文介绍了一种新颖的双重注意机制，包括由卷积神经网络生成的局部注意和由 Vision Transformer 生成的长程注意，提出了一种新的多头分区关注机制（MHPA）来解决计算复杂性和内存占用的问题，并基于此提出了一个分层视觉骨干网络 DualFormer，在多个计算机视觉任务中都取得了比较好的表现。