The success of Vision Transformer (ViT) has been widely reported on a wide
range of image recognition tasks. The merit of ViT over CNN has been largely
attributed to large training datasets or auxiliary pre-training. Without
pre-training, the performance of ViT on small datasets is limited because the
global self-attention has limited capacity in local modeling. Towards boosting
ViT on small datasets without pre-training, this work improves its local
modeling by applying a weight mask on the original self-attention matrix. A
straightforward way to locally adapt the self-attention matrix can be realized
by an element-wise learnable weight mask (ELM), for which our preliminary
results show promising results. However, the element-wise simple learnable
weight mask not only induces a non-trivial additional parameter overhead but
also increases the optimization complexity. To this end, this work proposes a
novel Gaussian mixture mask (GMM) in which one mask only has two learnable
parameters and it can be conveniently used in any ViT variants whose attention
mechanism allows the use of masks. Experimental results on multiple small
datasets demonstrate that the effectiveness of our proposed Gaussian mask for
boosting ViTs for free (almost zero additional parameter or computation cost).
Our code will be publicly available at
\href{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}{this https URL}.

本研究提出了一种新颖的高斯混合蒙版（GMM）方法，在没有预训练的情况下通过改进局部建模的方式来提升 Vision Transformer（ViT）在小数据集上的性能，实验证明该方法对于提升 ViT 的效果显著，几乎不增加额外参数或计算成本。

CNN 还是 ViT？透过卷积再探视觉 Transformer

CNN or ViT? Revisiting Vision Transformers Through the Lens of  Convolution

We propose a novel deep network structure called "Network In Network" (NIN)
to enhance model discriminability for local patches within the receptive field.
The conventional convolutional layer uses linear filters followed by a
nonlinear activation function to scan the input. Instead, we build micro neural
networks with more complex structures to abstract the data within the receptive
field. We instantiate the micro neural network with a multilayer perceptron,
which is a potent function approximator. The feature maps are obtained by
sliding the micro networks over the input in a similar manner as CNN; they are
then fed into the next layer. Deep NIN can be implemented by stacking mutiple
of the above described structure. With enhanced local modeling via the micro
network, we are able to utilize global average pooling over feature maps in the
classification layer, which is easier to interpret and less prone to
overfitting than traditional fully connected layers. We demonstrated the
state-of-the-art classification performances with NIN on CIFAR-10 and
CIFAR-100, and reasonable performances on SVHN and MNIST datasets.

本文提出了一种新型深度神经网络结构，称为 “Network In Network”，以增强感受野内局部补丁的模型可辨别度。通过在感受野内构建微型神经网络，本文在多个数据集上展示了 NIN 在图像分类方面表现出卓越的性能与利用全局平均池化来代替传统全连接层解决过拟合问题的优越性。