Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding token-wise sparsity and internal linear representations. To address this gap, we introduce the single-location regression task, where only one token in a sequence determines the output, and its position is a latent random variable, retrievable via a linear projection of the input. To solve this task, we propose a dedicated predictor, which turns out to be a simplified version of a non-linear self-attention layer. We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and internal linear structures.

本研究解决了注意力模型在单位置回归任务中的理论理解缺失问题，提出了一种简化的非线性自注意力层作为专用预测器，展示了其渐近贝叶斯最优性和训练动态分析。研究表明，该预测器能够有效地捕捉稀疏的令牌信息和内部线性结构。

注意力层可证明地解决单位置回归问题