In this paper, we investigate the convergence properties of the stochastic gradient descent (SGD) method and its variants, especially in training neural networks built from nonsmooth activation functions. We develop a novel framework that assigns different timescales to stepsizes for updating the momentum terms and variables, respectively. Under mild conditions, we prove the global convergence of our proposed framework in both single-timescale and two-timescale cases. We show that our proposed framework encompasses a wide range of well-known SGD-type methods, including heavy-ball SGD, SignSGD, Lion, normalized SGD and clipped SGD. Furthermore, when the objective function adopts a finite-sum formulation, we prove the convergence properties for these SGD-type methods based on our proposed framework. In particular, we prove that these SGD-type methods find the Clarke stationary points of the objective function with randomly chosen stepsizes and initial points under mild assumptions. Preliminary numerical experiments demonstrate the high efficiency of our analyzed SGD-type methods.

本研究论文探讨了随机梯度下降（SGD）方法及其变种在训练非光滑激活函数构建的神经网络中的收敛性质，提出了一种新的框架，分别为更新动量项和变量分配不同的时间尺度。在一些温和条件下，我们证明了我们提出的框架在单一时间尺度和双时间尺度情况下的全局收敛性。我们展示了我们提出的框架包含了许多著名的SGD类型方法，包括heavy-ball SGD、SignSGD、Lion、normalized SGD和clipped SGD。此外，当目标函数采用有限和形式时，我们证明了基于我们提出的框架的这些SGD类型方法的收敛性质。特别地，在温和的假设条件下，我们证明了这些SGD类型方法以随机选择的步长和初始点找到了目标函数的Clarke稳定点。初步的数值实验表明了我们分析的SGD类型方法的高效性。

非光滑非凸优化中随机次梯度方法的收敛性保证