BriefGPT.xyz
May, 2025
劣质数据如何导致优秀模型
When Bad Data Leads to Good Models
HTML
PDF
Kenneth Li, Yida Chen, Fernanda Viégas, Martin Wattenberg
TL;DR
本研究重新审视了大语言模型预训练中的数据质量问题,提出在预训练中使用更多有毒数据可能有助于降低后期训练后的输出毒性。实验结果表明,虽然有毒数据增加了模型的生成毒性,但同时也使得去除毒性变得更加容易,从而在毒性降低与保持模型能力之间实现更好的平衡。
Abstract
In large language model (LLM) pretraining,
Data Quality
is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and
Post-training
co-design. Speci
→