Although gradient descent with momentum is widely used in modern deep
learning, a concrete understanding of its effects on the training trajectory
still remains elusive. In this work, we empirically show that momentum gradient
descent with a large learning rate and learning rate warmup displays large
catapults, driving the iterates towards flatter minima than those found by
gradient descent. We then provide empirical evidence and theoretical intuition
that the large catapult is caused by momentum "amplifying" the
self-stabilization effect (Damian et al., 2023).

通过实证研究，我们发现使用较大学习速率和学习速率预热的动量梯度下降会产生大的弹射效应，将迭代点推向更平坦的最小值，我们提供了实证证据和理论解释表明这种弹射效应是由于动量 “放大” 了自稳定效应。

动量梯度下降中的大型弹射器研究

Large Catapults in Momentum Gradient Descent with Warmup: An Empirical  Study

Dataset distillation is the task of synthesizing a small dataset such that a
model trained on the synthetic set will match the test accuracy of the model
trained on the full dataset. In this paper, we propose a new formulation that
optimizes our distilled data to guide networks to a similar state as those
trained on real data across many training steps. Given a network, we train it
for several iterations on our distilled data and optimize the distilled data
with respect to the distance between the synthetically trained parameters and
the parameters trained on real data. To efficiently obtain the initial and
target network parameters for large-scale datasets, we pre-compute and store
training trajectories of expert networks trained on the real dataset. Our
method handily outperforms existing methods and also allows us to distill
higher-resolution visual data.

本研究提供了一种新的算法，使用合成数据集优化网络，可以快速、高效地将神经网络训练到与真实数据相似的状态，从而实现数据集精简化处理，并能够处理高分辨率视觉数据。