In this work, we investigate the effect of momentum on the optimisation
trajectory of gradient descent. We leverage a continuous-time approach in the
analysis of momentum gradient descent with step size $\gamma$ and momentum
parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda =
\frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path
and provides a simple acceleration rule. When training a $2$-layer diagonal
linear network in an overparametrised regression setting, we characterise the
recovered solution through an implicit regularisation problem. We then prove
that small values of $\lambda$ help to recover sparse solutions. Finally, we
give similar but weaker results for stochastic momentum gradient descent. We
provide numerical experiments which support our claims.

通过分析动量梯度下降的连续时间方法，我们研究了动量对优化轨迹的影响，并得到了唯一定义优化路径和提供简单加速规则的内在量 λ = γ / (1 - β)²。通过在超参数化回归环境中训练 2 层对角线线性网络，我们表征了隐式正则化问题中的恢复解，并证明了较小的 λ 值有助于恢复稀疏解。最后，我们为随机动量梯度下降提供了类似但较弱的结果，并提供了支持我们结论的数值实验。