In-context learning (ICL) refers to a remarkable capability of pretrained large language models, which can learn a new task given a few examples during inference. However, theoretical understanding of ICL is largely under-explored, particularly whether transformers can be trained to generalize to unseen examples in a prompt, which will require the model to acquire contextual knowledge of the prompt for generalization. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. The contextual generalization here can be attained via learning the template function for each task in-context, where all template functions lie in a linear space with $m$ basis functions. We analyze the training dynamics of one-layer multi-head transformers to in-contextly predict unlabeled inputs given partially labeled prompts, where the labels contain Gaussian noise and the number of examples in each prompt are not sufficient to determine the template. Under mild assumptions, we show that the training loss for a one-layer multi-head transformer converges linearly to a global minimum. Moreover, the transformer effectively learns to perform ridge regression over the basis functions. To our knowledge, this study is the first provable demonstration that transformers can learn contextual (i.e., template) information to generalize to both unseen examples and tasks when prompts contain only a small number of query-answer pairs.

本研究解决了对预训练大语言模型在上下文学习中如何对未见样例进行泛化的理论理解缺乏的问题。作者通过非线性回归任务分析变压器的训练动态，提出了在小样本提示下通过学习每个任务的模板函数来实现上下文泛化的创新方法。研究表明，在特定假设下，变压器能够有效学习上下文信息，从而实现对新任务和样例的泛化，这为机器学习模型的训练提供了新的视角。

基于表示的上下文学习：训练变压器的上下文泛化