There has been a debate about the superiority between vision Transformers and ConvNets, serving as the backbone of computer vision models. Although they are usually considered as two completely different architectures, in this paper, we interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework and compare their design choices side by side. In addition, our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets and vice versa. We demonstrate such potential through two specific studies. First, we inspect the role of softmax in vision Transformers as the activation function and find it can be replaced by commonly used ConvNets modules, such as ReLU and Layer Normalization, which results in a faster convergence rate and better performance. Second, following the design of depth-wise convolution, we create a corresponding depth-wise vision Transformer that is more efficient with comparable performance. The potential of the proposed unified interpretation is not limited to the given examples and we hope it can inspire the community and give rise to more advanced network architectures.

我们将视觉Transformer解释为具有动态卷积的ConvNets，并在统一框架中比较它们的设计选择，证明了视觉Transformer可以以ConvNets的设计空间为参考，从而指导网络设计，并展示了如何通过更换激活函数和创建效率更高的深度视觉Transformer来提高性能和收敛速度。该统一解释不仅仅适用于给定的示例，希望能够激发社区并产生更先进的网络架构。

将视觉Transformer解析为具有动态卷积的卷积神经网络