BriefGPT.xyz
Dec, 2023
稳定大型语言模型的预训练:再见尖峰
Spike No More: Stabilizing the Pre-training of Large Language Models
HTML
PDF
Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki
TL;DR
在大型语言模型的预训练过程中,我们探究了梯度爆炸的原因,并提出了满足预防梯度爆炸的条件的初始化方法和嵌入的简单修改方法,通过实验证明了这种组合在预训练过程中有效地防止了损失峰值的出现。
Abstract
The loss spike often occurs during
pre-training
of a
large language model
. The spikes degrade the performance of a
large language model
, a
→