We study the dynamics of gradient flow for training a multi-head softmax
attention model for in-context learning of multi-task linear regression. We
establish the global convergence of gradient flow under suitable choices of
initialization. In addition, we prove that an interesting "task allocation"
phenomenon emerges during the gradient flow dynamics, where each attention head
focuses on solving a single task of the multi-task model. Specifically, we
prove that the gradient flow dynamics can be split into three phases -- a
warm-up phase where the loss decreases rather slowly and the attention heads
gradually build up their inclination towards individual tasks, an emergence
phase where each head selects a single task and the loss rapidly decreases, and
a convergence phase where the attention parameters converge to a limit.
Furthermore, we prove the optimality of gradient flow in the sense that the
limiting model learned by gradient flow is on par with the best possible
multi-head softmax attention model up to a constant factor. Our analysis also
delineates a strict separation in terms of the prediction accuracy of ICL
between single-head and multi-head attention models. The key technique for our
convergence analysis is to map the gradient flow dynamics in the parameter
space to a set of ordinary differential equations in the spectral domain, where
the relative magnitudes of the semi-singular values of the attention weights
determines task allocation. To our best knowledge, our work provides the first
convergence result for the multi-head softmax attention model.

我们研究了多头 softmax 注意力模型在上下文学习多任务线性回归中的渐变流动动力学。通过适当选择初始化，我们确定了梯度流的全局收敛性。此外，我们证明了梯度流动力学中出现了有趣的 “任务分配” 现象，在这个过程中，每个注意力头专注于解决多任务模型的单个任务。具体而言，我们证明了梯度流动力学可以分为三个阶段 —— 热身阶段，其中损失减少得相对较慢，注意力头逐渐倾向于各自的任务；出现阶段，其中每个头选择一个任务，损失迅速降低；收敛阶段，注意参数收敛到一个极限。此外，我们证明了梯度流在优化上的最佳性，即由梯度流学习到的极限模型与最佳的多头 softmax 注意力模型相当，仅相差一个常数因子。我们的分析还明确了单头和多头注意力模型在 ICL 的预测准确性方面的严格区别。我们收敛分析的关键技术是将参数空间中的梯度流动力学映射到谱域中的一组常微分方程，其中注意力权重的半奇特征值的相对大小确定了任务分配。据我们所知，我们的工作为多头 softmax 注意力模型提供了第一个收敛结果。

多头 Softmax 注意力的上下文学习训练动态：出现、收敛和最优性

Training Dynamics of Multi-Head Softmax Attention for In-Context  Learning: Emergence, Convergence, and Optimality

Meta-learning, or learning-to-learn, seeks to design algorithms that can
utilize previous experience to rapidly learn new skills or adapt to new
environments. Representation learning -- a key tool for performing
meta-learning -- learns a data representation that can transfer knowledge
across multiple tasks, which is essential in regimes where data is scarce.
Despite a recent surge of interest in the practice of meta-learning, the
theoretical underpinnings of meta-learning algorithms are lacking, especially
in the context of learning transferable representations. In this paper, we
focus on the problem of multi-task linear regression -- in which multiple
linear regression models share a common, low-dimensional linear representation.
Here, we provide provably fast, sample-efficient algorithms to address the dual
challenges of (1) learning a common set of features from multiple, related
tasks, and (2) transferring this knowledge to new, unseen tasks. Both are
central to the general problem of meta-learning. Finally, we complement these
results by providing information-theoretic lower bounds on the sample
complexity of learning these linear features.

本文提出了一种基于 Meta-learning 的多任务线性回归算法，该算法能够通过低维线性表示快速学习多个相关任务，同时将这些知识传递到新的未见过的任务中，并提供了信息论下限，证明了该算法的高效性。