Recent theoretical results show that gradient descent on deep neural networks under exponential loss functions locally maximizes classification margin, which is equivalent to minimizing the norm of the weight matrices under margin constraints. This property of the solution however does not fully characterize the generalization performance. We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization. We then show that, after data separation is achieved, it is possible to dynamically reduce the training set by more than 99% without significant loss of performance. Interestingly, the resulting subset of "high capacity" features is not consistent across different training runs, which is consistent with the theoretical claim that all training points should converge to the same asymptotic margin under SGD and in the presence of both batch normalization and weight decay.

通过对深度神经网络上梯度下降算法的实证研究发现，通过训练集中带宽分布曲线的曲线下面积来量化模型的泛化性能是更精确的方法，并且在加入批量规范化和权重衰减的情况下得到的训练点会收敛到同一个渐近边界，但其高容量特征并不一致。

分类边界的分布：所有数据是否平等？