Stochastic Gradient Descent (SGD) and its variants are almost universally used to train neural networks and to fit a variety of other parametric models. An important hyperparameter in this context is the batch size, which determines how many samples are processed before an update of the parameters occurs. Previous studies have demonstrated the benefits of using variable batch sizes. In this work, we will theoretically derive optimal batch size schedules for SGD and similar algorithms, up to an error that is quadratic in the learning rate. To achieve this, we approximate the discrete process of parameter updates using a family of stochastic differential equations indexed by the learning rate. To better handle the state-dependent diffusion coefficient, we further expand the solution of this family into a series with respect to the learning rate. Using this setup, we derive a continuous-time optimal batch size schedule for a large family of diffusion coefficients and then apply the results in the setting of linear regression.

我们推导了随机梯度下降和类似算法的最优批次大小计划，通过近似离散参数更新过程为一族随机微分方程，进而使用学习率展开进行优化处理。我们应用这些结果于线性回归任务中。

利用连续时间控制和摄动理论解锁最佳批处理大小日程