Through pretraining on a corpus with various sources, Large Language Models
(LLMs) have gained impressive performance. However, the impact of each
component of the pretraining corpus remains opaque. As a result, the
organization of the pretraining corpus is still empirical and may deviate from
the optimal. To address this issue, we systematically analyze the impact of 48
datasets from 5 major categories of pretraining data of LLMs and measure their
impacts on LLMs using benchmarks about nine major categories of model
capabilities. Our analyses provide empirical results about the contribution of
multiple corpora on the performances of LLMs, along with their joint impact
patterns, including complementary, orthogonal, and correlational relationships.
We also identify a set of ``high-impact data'' such as Books that is
significantly related to a set of model capabilities. These findings provide
insights into the organization of data to support more efficient pretraining of
LLMs.

通过对 LLMs 的 48 个数据集进行系统分析，我们测量了它们对 LLMs 的性能的影响，并研究了它们之间的相关关系，从而为更有效的 LLMs 预训练提供了洞见。

通过机器学习去除预训练数据对大型语言模型的影响解析

Deciphering the lmpact of Pretraining Data on Large Language Models  through Machine Unlearning

Many recent studies on large-scale language models have reported successful
in-context zero- and few-shot learning ability. However, the in-depth analysis
of when in-context learning occurs is still lacking. For example, it is unknown
how in-context learning performance changes as the training corpus varies.
Here, we investigate the effects of the source and size of the pretraining
corpus on in-context learning in HyperCLOVA, a Korean-centric GPT-3 model. From
our in-depth investigation, we introduce the following observations: (1)
in-context learning performance heavily depends on the corpus domain source,
and the size of the pretraining corpus does not necessarily determine the
emergence of in-context learning, (2) in-context learning ability can emerge
when a language model is trained on a combination of multiple corpora, even
when each corpus does not result in in-context learning on its own, (3)
pretraining with a corpus related to a downstream task does not always
guarantee the competitive in-context learning performance of the downstream
task, especially in the few-shot setting, and (4) the relationship between
language modeling (measured in perplexity) and in-context learning does not
always correlate: e.g., low perplexity does not always imply high in-context
few-shot learning performance.

研究了韩国中心型 GPT-3 模型 HyperCLOVA 中的上下文零样本和少样本学习，发现性能主要取决于语料库域源和预训练语料库的大小，可以通过组合多个语料库预先训练获得上下文学习能力.