Despite an extensive body of literature on deep learning optimization, our current understanding of what makes an optimization algorithm effective is fragmented. In particular, we do not understand well whether enhanced optimization translates to improved generalizability. Current research overlooks the inherent stochastic nature of stochastic gradient descent (SGD) and its variants, resulting in a lack of comprehensive benchmarking and insight into their statistical performance. This paper aims to address this gap by adopting a novel approach. Rather than solely evaluating the endpoint of individual optimization trajectories, we draw from an ensemble of trajectories to estimate the stationary distribution of stochastic optimizers. Our investigation encompasses a wide array of techniques, including SGD and its variants, flat-minima optimizers, and new algorithms we propose under the Basin Hopping framework. Through our evaluation, which encompasses synthetic functions with known minima and real-world problems in computer vision and natural language processing, we emphasize fair benchmarking under a statistical framework, comparing stationary distributions and establishing statistical significance. Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD, noise-enabled variants, and novel optimizers utilizing the BH framework. Notably, these algorithms demonstrate performance on par with flat-minima optimizers like SAM, albeit with half the gradient evaluations. We anticipate that our work will catalyze further exploration in deep learning optimization, encouraging a shift away from single-model approaches towards methodologies that acknowledge and leverage the stochastic nature of optimizers.

本文采用一种新方法，通过估计随机优化器的稳态分布，从多条优化轨迹的集合中综合评估，旨在解决当前对深度学习优化算法有效性的理解不完整的问题。通过合成函数和计算机视觉、自然语言处理等领域的实际问题的评估，我们着重在统计框架下进行公平的基准测试和建立统计显著性，揭示了训练损失与保持精确度之间的关系以及SGD、噪声使能变体和利用BH框架的新优化器的可比性能，值得注意的是，这些算法展示了与SAM等平坦最小值优化器相当的性能，但梯度评估减少了一半。我们期待我们的工作将促进深度学习优化的进一步探索，鼓励从单模型方法转向更加认识和利用优化器的随机性质的方法。

深度学习的超出单一模型视图：随机优化算法的优化与泛化能力