In this work, we instantiate a regularized form of the gradient clipping
algorithm and prove that it can converge to the global minima of deep neural
network loss functions provided that the net is of sufficient width. We present
empirical evidence that our theoretically founded regularized gradient clipping
algorithm is also competitive with the state-of-the-art deep-learning
heuristics. Hence the algorithm presented here constitutes a new approach to
rigorous deep learning.
The modification we do to standard gradient clipping is designed to leverage
the PL* condition, a variant of the Polyak-Lojasiewicz inequality which was
recently proven to be true for various neural networks for any depth within a
neighborhood of the initialisation.

我们证明了基于正则化的梯度剪裁算法可以收敛于深度神经网络损失函数的全局最小值，只要网络具有足够的宽度，并且通过实证证明这一算法在深度学习中与现有的启发式方法相竞争，因此这一算法构成了一种新的严谨深度学习方法。

正则化梯度剪裁能可靠地训练宽且深的神经网络

Regularized Gradient Clipping Provably Trains Wide and Deep Neural  Networks

We prove non-asymptotic error bounds for particle gradient descent
(PGD)~(Kuntz et al., 2023), a recently introduced algorithm for maximum
likelihood estimation of large latent variable models obtained by discretizing
a gradient flow of the free energy. We begin by showing that, for models
satisfying a condition generalizing both the log-Sobolev and the
Polyak--{\L}ojasiewicz inequalities (LSI and P{\L}I, respectively), the flow
converges exponentially fast to the set of minimizers of the free energy. We
achieve this by extending a result well-known in the optimal transport
literature (that the LSI implies the Talagrand inequality) and its counterpart
in the optimization literature (that the P{\L}I implies the so-called quadratic
growth condition), and applying it to our new setting. We also generalize the
Bakry--\'Emery Theorem and show that the LSI/P{\L}I generalization holds for
models with strongly concave log-likelihoods. For such models, we further
control PGD's discretization error, obtaining non-asymptotic error bounds.
While we are motivated by the study of PGD, we believe that the inequalities
and results we extend may be of independent interest.

粒子梯度下降的误差界和对数 - 索伯列夫和塔拉格兰不等式的推广

Error bounds for particle gradient descent, and extensions of the  log-Sobolev and Talagrand inequalities

We consider the momentum stochastic gradient descent scheme (MSGD) and its
continuous-in-time counterpart in the context of non-convex optimization. We
show almost sure exponential convergence of the objective function value for
target functions that are Lipschitz continuous and satisfy the
Polyak-Lojasiewicz inequality on the relevant domain, and under assumptions on
the stochastic noise that are motivated by overparameterized supervised
learning applications. Moreover, we optimize the convergence rate over the set
of friction parameters and show that the MSGD process almost surely converges.

本文研究了非凸优化中动量随机梯度下降 (MSGD) 算法的连续性版本，并证明了在目标函数满足 Lipschitz 连续性和 Polyak-Lojasiewicz 不等式的条件下，MSGD 算法的目标函数极限收敛指数级收敛，同时在给定摩擦参数的情况下，MSGD 过程几乎必定收敛。

关于带有噪声的动量随机梯度下降法在机器学习中的收敛速率

Convergence rates for momentum stochastic gradient descent with noise of  machine learning type

Nonconvex minimax problems appear frequently in emerging machine learning
applications, such as generative adversarial networks and adversarial learning.
Simple algorithms such as the gradient descent ascent (GDA) are the common
practice for solving these nonconvex games and receive lots of empirical
success. Yet, it is known that these vanilla GDA algorithms with constant step
size can potentially diverge even in the convex setting. In this work, we show
that for a subclass of nonconvex-nonconcave objectives satisfying a so-called
two-sided Polyak-{\L}ojasiewicz inequality, the alternating gradient descent
ascent (AGDA) algorithm converges globally at a linear rate and the stochastic
AGDA achieves a sublinear rate. We further develop a variance reduced algorithm
that attains a provably faster rate than AGDA when the problem has the
finite-sum structure.

研究非凸极小问题的解决方案，提出两种算法 AGDA 和随机 AGDA，以及一种方差缩减算法，可以应用于类似生成对抗网络和对抗学习等新兴机器学习应用。