In-context learning (ICL) is now a common method for supervising large language models (LLMs): given labeled examples in the input context, the LLM learns to perform the task without weight updates. Despite ICL's prevalence and utility, we understand little about whether models supervised in this manner represent the underlying structure of their tasks, rather than superficial heuristics that only generalize to identically distributed examples. In this study, we investigate the robustness of LLMs supervised via ICL using the test case of sensitivity to syntax, which is a prerequisite for robust language understanding. Our experiments are based on two simple and well-controlled syntactic transformations tasks, where correct out-of-distribution generalization requires an accurate syntactic analysis of the input. We further investigate whether out-of-distribution generalization can be improved via chain-of-thought prompting, where the model is provided with a sequence of intermediate computation steps that illustrate how the task ought to be performed. In experiments with models from the GPT, PaLM, and Llama 2 families, we find large variance across LMs on this fundamental linguistic phenomenon, and that the variance is explained more by the composition of the pre-training corpus and supervision methods than by model size. In particular, we find evidence that models pre-trained on code generalize better, and benefit to a greater extent from chain-of-thought prompting.

在本研究中，我们通过对语法敏感性的测试案例来研究通过上下文学习监督的大型语言模型的鲁棒性，并调查模型的预训练语料库组成和监督方法对模型变异性的影响。我们发现，相较于模型大小，模型在这一基本语言现象上的变异性更多地受到预训练语料库组成和监督方法的影响。同时，我们还发现，在代码上进行预训练的模型更好地推广，并在更大程度上受到思维链提示的益处。

背景下的学习表现具有普适性，但并非始终稳定：以语法为例