The recent success of Vision Transformers is shaking the long dominance of
Convolutional Neural Networks (CNNs) in image recognition for a decade.
Specifically, in terms of robustness on out-of-distribution samples, recent
research finds that Transformers are inherently more robust than CNNs,
regardless of different training setups. Moreover, it is believed that such
superiority of Transformers should largely be credited to their
self-attention-like architectures per se. In this paper, we question that
belief by closely examining the design of Transformers. Our findings lead to
three highly effective architecture designs for boosting robustness, yet simple
enough to be implemented in several lines of code, namely a) patchifying input
images, b) enlarging kernel size, and c) reducing activation layers and
normalization layers. Bringing these components together, we are able to build
pure CNN architectures without any attention-like operations that are as robust
as, or even more robust than, Transformers. We hope this work can help the
community better understand the design of robust neural architectures. The code
is publicly available at this https URL

本文通过仔细研究 Transformers 的设计，发现在提高稳健性方面，使用卷积神经网络（CNNs）设计的架构同样有效。具体来说，我们的发现分别是：a）分块输入图像，b）增大卷积核尺寸，以及 c）减少激活层和归一化层的设计。我们的实验结果表明这三种设计的结合可以构建出实现简单，无需 attention-like 操作的卷积神经网络架构，其稳健性与甚至优于 Transformers。

卷积神经网络是否可以比 Transformer 更强大？

Can CNNs Be More Robust Than Transformers?

Transformer emerges as a powerful tool for visual recognition. In addition to
demonstrating competitive performance on a broad range of visual benchmarks,
recent works also argue that Transformers are much more robust than
Convolutions Neural Networks (CNNs). Nonetheless, surprisingly, we find these
conclusions are drawn from unfair experimental settings, where Transformers and
CNNs are compared at different scales and are applied with distinct training
frameworks. In this paper, we aim to provide the first fair & in-depth
comparisons between Transformers and CNNs, focusing on robustness evaluations.
With our unified training setup, we first challenge the previous belief that
Transformers outshine CNNs when measuring adversarial robustness. More
surprisingly, we find CNNs can easily be as robust as Transformers on defending
against adversarial attacks, if they properly adopt Transformers' training
recipes. While regarding generalization on out-of-distribution samples, we show
pre-training on (external) large-scale datasets is not a fundamental request
for enabling Transformers to achieve better performance than CNNs. Moreover,
our ablations suggest such stronger generalization is largely benefited by the
Transformer's self-attention-like architectures per se, rather than by other
training setups. We hope this work can help the community better understand and
benchmark the robustness of Transformers and CNNs. The code and models are
publicly available at this https URL

本文首次提供公平而深入的 Transformer 和 CNNs 的对比，重点关注强度的评估，并表明了 CNNs 可以像 Transformer 一样有效地抵御对抗攻击。同时，我们发现强大的泛化能力主要得益于 Transformer 的自我关注式结构，而不是其他的训练设置。