Several recent studies have demonstrated that attention-based networks, such as Vision Transformer (ViT), can outperform Convolutional Neural Networks (CNNs) on several computer vision tasks without using convolutional layers. This naturally leads to the following questions: Can a self-attention layer of ViT express any convolution operation? In this work, we prove that a single ViT layer with image patches as the input can perform any convolution operation constructively, where the multi-head attention mechanism and the relative positional encoding play essential roles. We further provide a lower bound on the number of heads for Vision Transformers to express CNNs. Corresponding with our analysis, experimental results show that the construction in our proof can help inject convolutional bias into Transformers and significantly improve the performance of ViT in low data regimes.

本文介绍视觉Transformer(ViT)在使用自注意力机制的基础上，探究其能否表达卷积操作，并证明使用输入图像块的单个ViT层可以构建任何卷积操作，其中多头注意机制和相对位置编码起着关键作用。作者还提供了Vision Transformer表达CNN所需头数的下限，该证明的构建可以帮助将卷积偏差注入Transformer，并在低数据环境下显著提高ViT的性能。

视觉Transformer能否执行卷积？