Several recent works have studied the convergence \textit{in high probability} of stochastic gradient descent (SGD) and its clipped variant. Compared to vanilla SGD, clipped SGD is practically more stable and has the additional theoretical benefit of logarithmic dependence on the failure probability. However, the convergence of other practical nonlinear variants of SGD, e.g., sign SGD, quantized SGD and normalized SGD, that achieve improved communication efficiency or accelerated convergence is much less understood. In this work, we study the convergence bounds \textit{in high probability} of a broad class of nonlinear SGD methods. For strongly convex loss functions with Lipschitz continuous gradients, we prove a logarithmic dependence on the failure probability, even when the noise is heavy-tailed. Strictly more general than the results for clipped SGD, our results hold for any nonlinearity with bounded (component-wise or joint) outputs, such as clipping, normalization, and quantization. Further, existing results with heavy-tailed noise assume bounded $\eta$-th central moments, with $\eta \in (1,2]$. In contrast, our refined analysis works even for $\eta=1$, strictly relaxing the noise moment assumptions in the literature.

通过研究一类广泛的非线性随机梯度下降方法在高概率下的收敛界限，我们证明了对于具有Lipschitz连续梯度的强凸损失函数，即使在噪声具有重尾分布的情况下，也能实现失败概率的对数依赖性，这对于任何具有有界（逐分量或联合）输出的非线性性质（如剪切、归一化和量化）都是成立的，与以往对于具有重尾噪声的研究相比，我们的研究结果在噪声的矩阶限制上得以松弛。

高概率收敛界限在重尾噪声下的非线性随机梯度下降