The success of Vision Transformer (ViT) has been widely reported on a wide
range of image recognition tasks. The merit of ViT over CNN has been largely
attributed to large training datasets or auxiliary pre-training. Without
pre-training, the performance of ViT on small datasets is limited because the
global self-attention has limited capacity in local modeling. Towards boosting
ViT on small datasets without pre-training, this work improves its local
modeling by applying a weight mask on the original self-attention matrix. A
straightforward way to locally adapt the self-attention matrix can be realized
by an element-wise learnable weight mask (ELM), for which our preliminary
results show promising results. However, the element-wise simple learnable
weight mask not only induces a non-trivial additional parameter overhead but
also increases the optimization complexity. To this end, this work proposes a
novel Gaussian mixture mask (GMM) in which one mask only has two learnable
parameters and it can be conveniently used in any ViT variants whose attention
mechanism allows the use of masks. Experimental results on multiple small
datasets demonstrate that the effectiveness of our proposed Gaussian mask for
boosting ViTs for free (almost zero additional parameter or computation cost).
Our code will be publicly available at
\href{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}{this https URL}.

本研究提出了一种新颖的高斯混合蒙版（GMM）方法，在没有预训练的情况下通过改进局部建模的方式来提升 Vision Transformer（ViT）在小数据集上的性能，实验证明该方法对于提升 ViT 的效果显著，几乎不增加额外参数或计算成本。

CNN 还是 ViT？透过卷积再探视觉 Transformer

CNN or ViT? Revisiting Vision Transformers Through the Lens of  Convolution

We propose a straightforward method that simultaneously reconstructs the 3D
facial structure and provides dense alignment. To achieve this, we design a 2D
representation called UV position map which records the 3D shape of a complete
face in UV space, then train a simple Convolutional Neural Network to regress
it from a single 2D image. We also integrate a weight mask into the loss
function during training to improve the performance of the network. Our method
does not rely on any prior face model, and can reconstruct full facial geometry
along with semantic meaning. Meanwhile, our network is very light-weighted and
spends only 9.8ms to process an image, which is extremely faster than previous
works. Experiments on multiple challenging datasets show that our method
surpasses other state-of-the-art methods on both reconstruction and alignment
tasks by a large margin.

本文提出一种简单的方法，可以同时重建三维人脸结构并提供密集对齐。该方法使用称为 UV 位置映射的二维表示来记录完整面部的三维形状，并使用简单卷积神经网络从单个二维图像中回归它。该方法不依赖于任何先前的人脸模型，并且可以重建全面部几何信息。与之前的作品相比，网络非常轻便，并且只需 9.8 毫秒即可处理图像。多次挑战数据集的实验表明，该方法在重建和对齐任务上均优于其他最先进的方法。