We find that existing language modeling datasets contain many near-duplicate
examples and long repetitive substrings. As a result, over 1% of the unprompted
output of language models trained on these datasets is copied verbatim from the
training data. We develop two tools that allow us to deduplicate training
datasets -- for example removing from C4 a single 61 word English sentence that
is repeated over 60,000 times. Deduplication allows us to train models that
emit memorized text ten times less frequently and require fewer train steps to
achieve the same or better accuracy. We can also reduce train-test overlap,
which affects over 4% of the validation set of standard datasets, thus allowing
for more accurate evaluation. We release code for reproducing our work and
performing dataset deduplication at
this https URL

研究发现现有的语言模型数据集包含大量近似重复的示例和长的重复子串。因此，在这些数据集上训练的语言模型的超过 1％的非提示输出是直接从训练数据复制的。我们开发了两个工具，使我们能够定位训练数据集中的冗余数据，以便通过去重来训练模型，减少内存化文本的发生。这也减少了训练和测试重叠，从而提高了模型的准确性。我们在指定的 https 网址上发布我们的工作和代码。