Transformer models exhibit in-context learning: the ability to accurately
predict the response to a novel query based on illustrative examples in the
input sequence. In-context learning contrasts with traditional in-weights
learning of query-output relationships. What aspects of the training data
distribution and architecture favor in-context vs in-weights learning? Recent
work has shown that specific distributional properties inherent in language,
such as burstiness, large dictionaries and skewed rank-frequency distributions,
control the trade-off or simultaneous appearance of these two forms of
learning. We first show that these results are recapitulated in a minimal
attention-only network trained on a simplified dataset. In-context learning
(ICL) is driven by the abrupt emergence of an induction head, which
subsequently competes with in-weights learning. By identifying progress
measures that precede in-context learning and targeted experiments, we
construct a two-parameter model of an induction head which emulates the full
data distributional dependencies displayed by the attention-based network. A
phenomenological model of induction head formation traces its abrupt emergence
to the sequential learning of three nested logits enabled by an intrinsic
curriculum. We propose that the sharp transitions in attention-based networks
arise due to a specific chain of multi-layer operations necessary to achieve
ICL, which is implemented by nested nonlinearities sequentially learned during
training.

Transformer 模型表现出上下文学习：基于输入序列中的示例，准确预测对新查询的响应。研究讨论了训练数据分布和架构方面哪些因素支持上下文学习和传统的查询 - 输出关系学习。研究还提出了在简化数据集上训练的最小关注网络模型，阐明了上下文学习受到诱导头突然出现的驱动。该研究建议，基于注意力的网络的明显转折是由于实现上下文学习所必需的特定多层操作链引起的。

一个基于机制的数据依赖和突发学习的在情境分类任务的基础

The mechanistic basis of data dependence and abrupt learning in an  in-context classification task

Large language models based on transformers have achieved great empirical
successes. However, as they are deployed more widely, there is a growing need
to better understand their internal mechanisms in order to make them more
reliable. These models appear to store vast amounts of knowledge from their
training data, and to adapt quickly to new information provided in their
context or prompt. We study how transformers balance these two types of
knowledge by considering a synthetic setup where tokens are generated from
either global or context-specific bigram distributions. By a careful empirical
analysis of the training process on a simplified two-layer transformer, we
illustrate the fast learning of global bigrams and the slower development of an
"induction head" mechanism for the in-context bigrams. We highlight the role of
weight matrices as associative memories, provide theoretical insights on how
gradients enable their learning during training, and study the role of
data-distributional properties.

本研究使用合成机制来考察 transformers 在处理全局信息与上下文信息时的权衡，发现这些模型相对较快地学习了全局信息，但对于上下文信息中的二元组的识别则较慢，同时探究了权重矩阵作为联想记忆的作用以及梯度如何使其在训练时进行学习的理论机制，同时研究了数据分布属性的作用。