BriefGPT.xyz
Feb, 2024
通过机器学习去除预训练数据对大型语言模型的影响解析
Deciphering the lmpact of Pretraining Data on Large Language Models through Machine Unlearning
HTML
PDF
Yang Zhao, Li Du, Xiao Ding, Kai Xiong, Zhouhao Sun...
TL;DR
通过对LLMs的48个数据集进行系统分析,我们测量了它们对LLMs的性能的影响,并研究了它们之间的相关关系,从而为更有效的LLMs预训练提供了洞见。
Abstract
Through pretraining on a corpus with various sources,
large language models
(LLMs) have gained impressive performance. However, the impact of each component of the
pretraining corpus
remains opaque. As a result,
→