Diffusion Models (DM) and Consistency Models (CM) are two types of popular
generative models with good generation quality on various tasks. When training
DM and CM, intermediate weight checkpoints are not fully utilized and only the
last converged checkpoint is used. In this work, we find that high-quality
model weights often lie in a basin which cannot be reached by SGD but can be
obtained by proper checkpoint averaging. Based on these observations, we
propose LCSC, a simple but effective and efficient method to enhance the
performance of DM and CM, by combining checkpoints along the training
trajectory with coefficients deduced from evolutionary search. We demonstrate
the value of LCSC through two use cases: $\textbf{(a) Reducing training cost.}$
With LCSC, we only need to train DM/CM with fewer number of iterations and/or
lower batch sizes to obtain comparable sample quality with the fully trained
model. For example, LCSC achieves considerable training speedups for CM
(23$\times$ on CIFAR-10 and 15$\times$ on ImageNet-64). $\textbf{(b) Enhancing
pre-trained models.}$ Assuming full training is already done, LCSC can further
improve the generation quality or speed of the final converged models. For
example, LCSC achieves better performance using 1 number of function evaluation
(NFE) than the base model with 2 NFE on consistency distillation, and decreases
the NFE of DM from 15 to 9 while maintaining the generation quality on
CIFAR-10. Our code is available at
this https URL

利用合适的检查点平均系数， LCSC 方法能够通过沿训练轨迹结合检查点来增强 DM 和 CM 的性能，以降低训练成本并提高预先训练模型的生成质量。

保存检查点线性组合提高一致性和扩散模型性能

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion  Models Better

The conventional recipe for Automatic Speech Recognition (ASR) models is to
1) train multiple checkpoints on a training set while relying on a validation
set to prevent overfitting using early stopping and 2) average several last
checkpoints or that of the lowest validation losses to obtain the final model.
In this paper, we rethink and update the early stopping and checkpoint
averaging from the perspective of the bias-variance tradeoff. Theoretically,
the bias and variance represent the fitness and variability of a model and the
tradeoff of them determines the overall generalization error. But, it's
impractical to evaluate them precisely. As an alternative, we take the training
loss and validation loss as proxies of bias and variance and guide the early
stopping and checkpoint averaging using their tradeoff, namely an Approximated
Bias-Variance Tradeoff (ApproBiVT). When evaluating with advanced ASR models,
our recipe provides 2.5%-3.7% and 3.1%-4.6% CER reduction on the AISHELL-1 and
AISHELL-2, respectively.

在这篇论文中，我们从偏差和方差的权衡角度重新考虑并更新了早停和检查点平均值的方法，使用训练损失和验证损失作为偏差和方差的近似代理，并在高级 ASR 模型上验证时，我们的方法在 AISHELL-1 和 AISHELL-2 上分别降低了 2.5%-3.7% 和 3.1%-4.6% 的 CER。

ApproBiVT: 使用近似的偏差 - 方差折衷指导早停和检查点平均的先导 ASR 模型更好地泛化

ApproBiVT: Lead ASR Models to Generalize Better Using Approximated  Bias-Variance Tradeoff Guided Early Stopping and Checkpoint Averaging

Training LLMs is expensive, and recent evidence indicates training all the
way to convergence is inefficient. In this paper, we investigate the ability of
a simple idea, checkpoint averaging along the trajectory of a training run to
improve the quality of models before they have converged. This approach incurs
no extra cost during training or inference. Specifically, we analyze the
training trajectories of Pythia LLMs with 1 to 12 billion parameters and
demonstrate that, particularly during the early to mid stages of training, this
idea accelerates convergence and improves both test and zero-shot
generalization. Loss spikes are a well recognized problem in LLM training; in
our analysis we encountered two instances of this in the underlying
trajectories, and both instances were mitigated by our averaging.
For a 6.9B parameter LLM, for example, our early weight averaging recipe can
save upto 4200 hours of GPU time, which corresponds to significant savings in
cloud compute costs.

通过运用检查点平均化方法来改进大型语言模型（LLMs）的质量，在不增加额外培训或推理成本的前提下，缩短训练时间并提高测试和零样本泛化能力。

理解早期权重平均对大型语言模型训练的有效性

Understanding the Effectiveness of Early Weight Averaging for Training  Large Language Models

Checkpoint averaging is a simple and effective method to boost the
performance of converged neural machine translation models. The calculation is
cheap to perform and the fact that the translation improvement almost comes for
free, makes it widely adopted in neural machine translation research. Despite
the popularity, the method itself simply takes the mean of the model parameters
from several checkpoints, the selection of which is mostly based on empirical
recipes without many justifications. In this work, we revisit the concept of
checkpoint averaging and consider several extensions. Specifically, we
experiment with ideas such as using different checkpoint selection strategies,
calculating weighted average instead of simple mean, making use of gradient
information and fine-tuning the interpolation weights on development data. Our
results confirm the necessity of applying checkpoint averaging for optimal
performance, but also suggest that the landscape between the converged
checkpoints is rather flat and not much further improvement compared to simple
averaging is to be obtained.

在神经机器翻译中，采用 checkpoint 平均值来提高模型的性能，此方法计算简单，被广泛采用。本文通过实验，考虑了不同 checkpoint 选择策略、加权平均、梯度信息等方面的应用，结果表明 checkpoint 平均值对于性能的提高是必要的，但随着收敛到最优模型的序列不断增加，模型的改善空间较小。