The success of Vision Transformer (ViT) has been widely reported on a wide
range of image recognition tasks. The merit of ViT over CNN has been largely
attributed to large training datasets or auxiliary pre-training. Without
pre-training, the performance of ViT on small datasets is limited because the
global self-attention has limited capacity in local modeling. Towards boosting
ViT on small datasets without pre-training, this work improves its local
modeling by applying a weight mask on the original self-attention matrix. A
straightforward way to locally adapt the self-attention matrix can be realized
by an element-wise learnable weight mask (ELM), for which our preliminary
results show promising results. However, the element-wise simple learnable
weight mask not only induces a non-trivial additional parameter overhead but
also increases the optimization complexity. To this end, this work proposes a
novel Gaussian mixture mask (GMM) in which one mask only has two learnable
parameters and it can be conveniently used in any ViT variants whose attention
mechanism allows the use of masks. Experimental results on multiple small
datasets demonstrate that the effectiveness of our proposed Gaussian mask for
boosting ViTs for free (almost zero additional parameter or computation cost).
Our code will be publicly available at
\href{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}{this https URL}.

本研究提出了一种新颖的高斯混合蒙版（GMM）方法，在没有预训练的情况下通过改进局部建模的方式来提升 Vision Transformer（ViT）在小数据集上的性能，实验证明该方法对于提升 ViT 的效果显著，几乎不增加额外参数或计算成本。