Transformers have a remarkable ability to learn and execute tasks based on
examples provided within the input itself, without explicit prior training. It
has been argued that this capability, known as in-context learning (ICL), is a
cornerstone of Transformers' success, yet questions about the necessary sample
complexity, pretraining task diversity, and context length for successful ICL
remain unresolved. Here, we provide a precise answer to these questions in an
exactly solvable model of ICL of a linear regression task by linear attention.
We derive sharp asymptotics for the learning curve in a phenomenologically-rich
scaling regime where the token dimension is taken to infinity; the context
length and pretraining task diversity scale proportionally with the token
dimension; and the number of pretraining examples scales quadratically. We
demonstrate a double-descent learning curve with increasing pretraining
examples, and uncover a phase transition in the model's behavior between low
and high task diversity regimes: In the low diversity regime, the model tends
toward memorization of training tasks, whereas in the high diversity regime, it
achieves genuine in-context learning and generalization beyond the scope of
pretrained tasks. These theoretical insights are empirically validated through
experiments with both linear attention and full nonlinear Transformer
architectures.

Transformers 在无需显式先前训练的情况下，基于输入示例学习和执行任务的能力，也称为上下文学习（ICL），是其成功的基础。本研究提供了关于所需样本复杂性、预训练任务多样性和上下文长度对成功 ICL 的明确答案，采用线性关注在 ICL 线性回归任务的可解模型中推导出了学习曲线的锐利渐近线。通过实验证明了随着先前训练示例数量增加，学习曲线具有双峰，且模型的行为在低和高任务多样性之间出现相变：在低多样性情况下，模型趋向于记忆训练任务，而在高多样性情况下，它实现了真正的上下文学习并在预训练任务范围之外进行泛化。这些理论洞见通过线性关注和完全非线性 Transformer 架构的实验进行了经验证实。