Language models have become a critical technology to tackling a wide range of
natural language processing tasks, yet many details about how the
best-performing language models were developed are not reported. In particular,
information about their pretraining corpora is seldom discussed: commercial
language models rarely provide any information about their data; even open
models rarely release datasets they are trained on, or an exact recipe to
reproduce them. As a result, it is challenging to conduct certain threads of
language modeling research, such as understanding how training data impacts
model capabilities and shapes their limitations. To facilitate open research on
language model pretraining, we release Dolma, a three trillion tokens English
corpus, built from a diverse mixture of web content, scientific papers, code,
public-domain books, social media, and encyclopedic materials. In addition, we
open source our data curation toolkit to enable further experimentation and
reproduction of our work. In this report, we document Dolma, including its
design principles, details about its construction, and a summary of its
contents. We interleave this report with analyses and experimental results from
training language models on intermediate states of Dolma to share what we have
learned about important data curation practices, including the role of content
or quality filters, deduplication, and multi-source mixing. Dolma has been used
to train OLMo, a state-of-the-art, open language model and framework designed
to build and study the science of language modeling.

释放 Dolma，这是一个由各种网络内容、科学论文、代码、公共领域图书、社交媒体和百科全书材料混合构建的拥有三万亿个标记的英文语料库。我们还开源了数据整理工具包，以便进一步实验和复现我们的工作。报告中描述了 Dolma 的设计原则、构建细节和内容摘要，并与在 Dolma 的中间状态上训练语言模型的分析和实验结果交叉展示，分享了我们对重要数据整理实践的了解，包括内容或质量过滤器、去重和多源混合的作用。Dolma 已被用于训练 OLMo，这是一个设计用于构建和研究语言建模科学的最先进的开放式语言模型和框架。

Dolma: 一个包含三万亿标记的开放语料库，用于语言模型预训练研究

Dolma: an Open Corpus of Three Trillion Tokens for Language Model  Pretraining Research

Large volumes of text data have contributed significantly to the development
of large language models (LLMs) in recent years. This data is typically
acquired by scraping the internet, leading to pretraining datasets comprised of
noisy web text. To date, efforts to prune these datasets down to a higher
quality subset have relied on hand-crafted heuristics encoded as rule-based
filters. In this work, we take a wider view and explore scalable estimates of
data quality that can be used to systematically measure the quality of
pretraining data. We perform a rigorous comparison at scale of the simple data
quality estimator of perplexity, as well as more sophisticated and
computationally intensive estimates of the Error L2-Norm and memorization.
These metrics are used to rank and prune pretraining corpora, and we
subsequently compare LLMs trained on these pruned datasets. Surprisingly, we
find that the simple technique of perplexity outperforms our more
computationally expensive scoring methods. We improve over our no-pruning
baseline while training on as little as 30% of the original training dataset.
Our work sets the foundation for unexplored strategies in automatically
curating high quality corpora and suggests the majority of pretraining data can
be removed while retaining performance.

通过比较数据质量的简单估算方法困惑度和更复杂、计算密集的评估方法的错误 L2 范数和记忆化，我们发现困惑度方法在去除数据噪声和提升预训练数据集质量方面具有较好的效果。我们能够在仅使用原始训练数据的 30% 进行训练的情况下，改进我们的基准模型，这为自动筛选高质量数据集提供了新的方法论，并表明大部分的预训练数据可被删除而保持性能。

当少即是多：探究大规模预训练 LLMs 的数据修剪

When Less is More: Investigating Data Pruning for Pretraining LLMs at  Scale

Massively multilingual transformers pretrained with language modeling
objectives (e.g., mBERT, XLM-R) have become a de facto default transfer
paradigm for zero-shot cross-lingual transfer in NLP, offering unmatched
transfer performance. Current downstream evaluations, however, verify their
efficacy predominantly in transfer settings involving languages with sufficient
amounts of pretraining data, and with lexically and typologically close
languages. In this work, we analyze their limitations and show that
cross-lingual transfer via massively multilingual transformers, much like
transfer via cross-lingual word embeddings, is substantially less effective in
resource-lean scenarios and for distant languages. Our experiments,
encompassing three lower-level tasks (POS tagging, dependency parsing, NER), as
well as two high-level semantic tasks (NLI, QA), empirically correlate transfer
performance with linguistic similarity between the source and target languages,
but also with the size of pretraining corpora of target languages. We also
demonstrate a surprising effectiveness of inexpensive few-shot transfer (i.e.,
fine-tuning on a few target-language instances after fine-tuning in the source)
across the board. This suggests that additional research efforts should be
invested to reach beyond the limiting zero-shot conditions.

分析了 massively multilingual transformers 在零射击跨语言场景中的局限性，并表明在资源匮乏和对比较遥远语言的情况下通过多语言转换的跨语言转移实际上不太有效。通过几个低级和高级自然语言处理任务的实验，确立了源语言和目标语言之间的语言相似度以及目标语言的预训练语料库的大小与转移性能之间的相关性。另外，研究表明通过在源语言上调整细节，再在目标语言上进行少量微调的 few-shot transfer 在多语言转换中十分有效。