Masked image modeling (MIM), which predicts randomly masked patches from
unmasked ones, has emerged as a promising approach in self-supervised vision
pretraining. However, the theoretical understanding of MIM is rather limited,
especially with the foundational architecture of transformers. In this paper,
to the best of our knowledge, we provide the first end-to-end theory of
learning one-layer transformers with softmax attention in MIM self-supervised
pretraining. On the conceptual side, we posit a theoretical mechanism of how
transformers, pretrained with MIM, produce empirically observed local and
diverse attention patterns on data distributions with spatial structures that
highlight feature-position correlations. On the technical side, our end-to-end
analysis of the training dynamics of softmax-based transformers accommodates
both input and position embeddings simultaneously, which is developed based on
a novel approach to track the interplay between the attention of
feature-position and position-wise correlations.

本文提供了首个关于 MIM 自监督预训练中使用 softmax 注意力的一层 transformer 的端到端理论，旨在解释 transformer 的理论机制，并分析其训练动态，以同时考虑输入和位置嵌入，在数据分布中产生局部和多样化的注意力模式，突出特征位置相关性和位置相关性。