Stochastic gradient descent (SGD) has been deployed to solve highly non-linear and non-convex machine learning problems such as the training of deep neural networks. However, previous works on SGD often rely on highly restrictive and unrealistic assumptions about the nature of noise in SGD. In this work, we mathematically construct examples that defy previous understandings of SGD. For example, our constructions show that: (1) SGD may converge to a local maximum; (2) SGD may escape a saddle point arbitrarily slowly; (3) SGD may prefer sharp minima over the flat ones; and (4) AMSGrad may converge to a local maximum. Our result suggests that the noise structure of SGD might be more important than the loss landscape in neural network training and that future research should focus on deriving the actual noise structure in deep learning.

本文研究了随机梯度下降（SGD）算法的全局最优性，在探究了之前研究的局限性之后，发现在一些情况下，SGD可能表现出奇怪且不可取的行为。作者通过构建高维度的优化问题及数据分布，证明了SGD在大多数情况下会收敛到局部最大值，逃离鞍点所需时间会相当长，会偏爱锐利的最小值而非平坦的。文中还举了一个小型神经网络作为实例来验证理论，结果强调深度学习中SGD的重要性，需要同时分析小批量采样、离散时间更新和实际数据名称等因素。

使用恒定大学习率的SGD可收敛于局部最大值