Why does training deep neural networks using stochastic gradient descent (SGD) result in a generalization error that does not worsen with the number of parameters in the network? To answer this question, we advocate a notion of effective model capacity that is dependent on {\em a given random initialization of the network} and not just the training algorithm and the data distribution. We provide empirical evidences that demonstrate that the model capacity of SGD-trained deep networks is in fact restricted through implicit regularization of {\em the $\ell_2$ distance from the initialization}. We also provide theoretical arguments that further highlight the need for initialization-dependent notions of model capacity. We leave as open questions how and why distance from initialization is regularized, and whether it is sufficient to explain generalization.

本文研究了使用随机梯度下降（SGD）训练深度神经网络为什么会导致泛化误差不随网络参数数量恶化的问题，并提出一种基于给定随机初始化的有效模型容量的概念。作者通过实验证明了SGD训练的深度网络的模型容量实际上受限于从初始化开始的L2距离的隐式正则化，并提供理论论证来进一步强调了初始化相关的模型容量概念的必要性。然而此文留下了如何以及为什么对初始化距离进行正则化，以及它是否足以解释泛化的问题。

深度网络的泛化：与起始点距离的作用