In this paper we aim to formally explain the phenomenon of fast convergence
of SGD observed in modern machine learning. The key observation is that most
modern learning architectures are over-parametrized and are trained to
interpolate the data by driving the empirical loss (classification and
regression) close to zero. While it is still unclear why these interpolated
solutions perform well on test data, we show that these regimes allow for fast
convergence of SGD, comparable in number of iterations to full gradient
descent.
For convex loss functions we obtain an exponential convergence bound for {\it
mini-batch} SGD parallel to that for full gradient descent. We show that there
is a critical batch size $m^*$ such that: (a) SGD iteration with mini-batch
size $m\leq m^*$ is nearly equivalent to $m$ iterations of mini-batch size $1$
(\emph{linear scaling regime}). (b) SGD iteration with mini-batch $m> m^*$ is
nearly equivalent to a full gradient descent iteration (\emph{saturation
regime}).
Moreover, for the quadratic loss, we derive explicit expressions for the
optimal mini-batch and step size and explicitly characterize the two regimes
above. The critical mini-batch size can be viewed as the limit for effective
mini-batch parallelization. It is also nearly independent of the data size,
implying $O(n)$ acceleration over GD per unit of computation. We give
experimental evidence on real data which closely follows our theoretical
analyses.
Finally, we show how our results fit in the recent developments in training
deep neural networks and discuss connections to adaptive rates for SGD and
variance reduction.

本文旨在正式解释当代机器学习中观察到的 SGD 快速收敛现象。我们重点观察现代学习架构是过参数化的，并且被训练用于通过将经验损失（分类和回归）驱动到接近零的插值数据。我们表明，这些插值方案允许 SGD 快速收敛，与全梯度下降迭代次数相当。对于凸损失函数，我们获得了与全梯度下降相似的 “迷你批次” SGD 的指数收敛界限。关键的迷你批次大小可以视为有效迷你批次并行化的限制，并且几乎独立于数据大小。

插值的威力：理解 SGD 在现代超参模型学习中的有效性

The Power of Interpolation: Understanding the Effectiveness of SGD in  Modern Over-parametrized Learning

Background: Statistical mechanics results (Dauphin et al. (2014); Choromanska
et al. (2015)) suggest that local minima with high error are exponentially rare
in high dimensions. However, to prove low error guarantees for Multilayer
Neural Networks (MNNs), previous works so far required either a heavily
modified MNN model or training method, strong assumptions on the labels (e.g.,
"near" linear separability), or an unrealistic hidden layer with
$\Omega\left(N\right)$ units.
Results: We examine a MNN with one hidden layer of piecewise linear units, a
single output, and a quadratic loss. We prove that, with high probability in
the limit of $N\rightarrow\infty$ datapoints, the volume of differentiable
regions of the empiric loss containing sub-optimal differentiable local minima
is exponentially vanishing in comparison with the same volume of global minima,
given standard normal input of dimension
$d_{0}=\tilde{\Omega}\left(\sqrt{N}\right)$, and a more realistic number of
$d_{1}=\tilde{\Omega}\left(N/d_{0}\right)$ hidden units. We demonstrate our
results numerically: for example, $0\%$ binary classification training error on
CIFAR with only $N/d_{0}\approx 16$ hidden neurons.

通过证明，使用具有分段线性单元、单输出和二次损失的一层隐藏层的 MNN，在标准正常输入和更现实的数量的隐藏单元情况下，可以消失指数数量的不同可微区域的的局部最小值，以及通过数值演示达到的结果，发现在 CIFAR 上只有 16 个隐藏神经元时可以达到 0％的二进制分类训练误差。