Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan...
TL;DR预训练数据、语言模型、数据混合规律、模型性能和数据计划
Abstract
pretraining data of large language models composes multiple domains (e.g.,
web texts, academic papers, codes), whose mixture proportions crucially impact
the competence of outcome models. While existing endeavors