In-context learning (ICL) of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples. In this paper, we study the efficacy of ICL from the viewpoint of statistical learning theory. We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer, pretrained on nonparametric regression tasks sampled from general function spaces including the Besov space and piecewise $\gamma$-smooth class. We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context by encoding the most relevant basis representations during pretraining. Our analysis extends to high-dimensional or sequential data and distinguishes the \emph{pretraining} and \emph{in-context} generalization gaps. Furthermore, we establish information-theoretic lower bounds for meta-learners w.r.t. both the number of tasks and in-context examples. These findings shed light on the roles of task diversity and representation learning for ICL.

本文研究了大型语言模型的上下文学习（ICL）在统计学习理论中的有效性，提出了变压器在非参数回归任务中的逼近和泛化误差界限。研究表明，经过充分训练的变压器不仅能够实现最小最大最优的估计风险，还能在上下文中提升表示能力，进而揭示任务多样性和表征学习在ICL中的关键作用。

变压器是最小最大最优的非参数上下文学习者