In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically explains how the attention mechanism facilitates ICL under certain data models. It remains unclear how the other building blocks of the transformer contribute to ICL. To address this question, we study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data, where each token in the Markov chain statistically depends on the previous $n$ tokens. We analyze a sophisticated transformer model featuring relative positional embedding, multi-head softmax attention, and a feed-forward layer with normalization. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the induction head mechanism with a learned feature, resulting from the congruous contribution of all the building blocks. In the limiting model, the first attention layer acts as a $\mathit{copier}$, copying past tokens within a given window to each position, and the feed-forward network with normalization acts as a $\mathit{selector}$ that generates a feature vector by only looking at informationally relevant parents from the window. Finally, the second attention layer is a $\mathit{classifier}$ that compares these features with the feature at the output position, and uses the resulting similarity scores to generate the desired output. Our theory is further validated by experiments.

本研究针对大型语言模型（LLM）中上下文学习的理论基础缺乏明确解释的问题，探讨了变换器其他构件如何促进上下文学习。通过分析一个包含相对位置嵌入、多头softmax注意力和归一化前馈层的复杂变换器模型，我们证明了在交叉熵损失下的梯度流收敛到一个包含引导头机制的极限模型，从而揭示了训练动态的贡献及其潜在影响。

揭示引导头：可证明的训练动态与变换器中的特征学习