The graduated optimization approach is a heuristic method for finding globally optimal solutions for nonconvex functions and has been theoretically analyzed in several studies. This paper defines a new family of nonconvex functions for graduated optimization, discusses their sufficient conditions, and provides a convergence analysis of the graduated optimization algorithm for them. It shows that stochastic gradient descent (SGD) with mini-batch stochastic gradients has the effect of smoothing the function, the degree of which is determined by the learning rate and batch size. This finding provides theoretical insights from a graduated optimization perspective on why large batch sizes fall into sharp local minima, why decaying learning rates and increasing batch sizes are superior to fixed learning rates and batch sizes, and what the optimal learning rate scheduling is. To the best of our knowledge, this is the first paper to provide a theoretical explanation for these aspects. Moreover, a new graduated optimization framework that uses a decaying learning rate and increasing batch size is analyzed and experimental results of image classification that support our theoretical findings are reported.

本文定义了用于 graduated optimization 的一类新的非凸函数，讨论了其充分条件，并对 graduated optimization 算法的收敛性进行了分析。研究发现，带有 mini-batch 随机梯度的随机梯度下降 (SGD) 方法可以使函数平滑的程度由学习率和 batch size 决定。此发现从 graduated optimization 的角度提供了理论洞察，解释了为何大批量大小会陷入尖锐的局部最小值，以及为何逐渐减小的学习率和逐渐增大的批量大小优于固定的学习率和批量大小，并给出了最佳的学习率调度方法。此外，分析了一种新的 graduated optimization 框架，该框架使用逐渐减小的学习率和逐渐增大的批量大小，并报告了支持我们理论发现的图像分类的实验结果。

使用随机梯度下降平滑非凸函数：隐式逐渐优化与最优噪声调度的分析