The strong performance of vision transformers on image classification and other vision tasks is often attributed to the design of their multi-head attention layers. However, the extent to which attention is responsible for this strong performance remains unclear. In this short report, we ask: is the attention layer even necessary? Specifically, we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension. The resulting architecture is simply a series of feed-forward layers applied over the patch and feature dimensions in an alternating fashion. In experiments on ImageNet, this architecture performs surprisingly well: a ViT/DeiT-base-sized model obtains 74.9\% top-1 accuracy, compared to 77.9\% and 79.9\% for ViT and DeiT respectively. These results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought. We hope these results prompt the community to spend more time trying to understand why our current models are as effective as they are.

通过在Vision Transformer中替换Attention层为基于Patch维度的前馈网络，本文发现除Attention层外，Transformer中的其他方面，例如patch embedding，可能更加关键。在ImageNet实验中，新架构的表现意外地好，为74.9% top-1 accuracy。

你真的需要注意力吗？仅使用一堆前馈层就可以在ImageNet上惊人地表现