This paper revisits the convergence of Stochastic Mirror Descent (SMD) in the contemporary nonconvex optimization setting. Existing results for batch-free nonconvex SMD restrict the choice of the distance generating function (DGF) to be differentiable with Lipschitz continuous gradients, thereby excluding important setups such as Shannon entropy. In this work, we present a new convergence analysis of nonconvex SMD supporting general DGF, that overcomes the above limitations and relies solely on the standard assumptions. Moreover, our convergence is established with respect to the Bregman Forward-Backward envelope, which is a stronger measure than the commonly used squared norm of gradient mapping. We further extend our results to guarantee high probability convergence under sub-Gaussian noise and global convergence under the generalized Bregman Proximal Polyak-{\L}ojasiewicz condition. Additionally, we illustrate the advantages of our improved SMD theory in various nonconvex machine learning tasks by harnessing nonsmooth DGFs. Notably, in the context of nonconvex differentially private (DP) learning, our theory yields a simple algorithm with a (nearly) dimension-independent utility bound. For the problem of training linear neural networks, we develop provably convergent stochastic algorithms.

该论文重新审视了当今非凸优化设置中随机镜像下降（Stochastic Mirror Descent，SMD）的收敛性。通过支持一般距离生成函数（distance generating function，DGF）的新的非凸SMD收敛分析，该论文克服了先前结果对于具有光滑连续的梯度的可微性DGF的限制，并仅依赖于标准假设。此外，该论文通过Bregman前向-后向包络建立了收敛性，该包络是比常用的梯度映射的平方范数更强的度量。进一步，该论文将结果扩展到在次高斯噪声下的高概率收敛和在广义Bregman Proximal Polyak-Lojasiewicz条件下的全局收敛。此外，通过利用非光滑DGFs，我们展示了改进的SMD理论在各种非凸机器学习任务中的优势。值得注意的是，在非凸差分隐私（differentially private，DP）学习的背景下，我们的理论提供了一个（几乎）维度无关的效用界算法。对于训练线性神经网络的问题，我们开发了可证明收敛的随机算法。

用广义的布雷格曼散度驯服非凸随机镜像下降