Recent advances in deep learning have given us some very promising results on
the generalization ability of deep neural networks, however literature still
lacks a comprehensive theory explaining why heavily over-parametrized models
are able to generalize well while fitting the training data. In this paper we
propose a PAC type bound on the generalization error of feedforward ReLU
networks via estimating the Rademacher complexity of the set of networks
available from an initial parameter vector via gradient descent. The key idea
is to bound the sensitivity of the network's gradient to perturbation of the
input data along the optimization trajectory. The obtained bound does not
explicitly depend on the depth of the network. Our results are experimentally
verified on the MNIST and CIFAR-10 datasets.

最近深度学习取得了一些极有前途的成果，尤其是在深度神经网络的泛化能力方面，然而相关文献中仍缺乏一种全面的理论来解释为什么过度参数化的模型能够在拟合训练数据的同时表现出良好的泛化能力。本文通过估计通过梯度下降从初始参数向量获得的网络集合的 Rademacher 复杂度，提出了对前馈 ReLU 网络的泛化误差进行 PAC 类型边界的方法。关键思想是限定网络梯度对优化轨迹上输入数据扰动的敏感性。所得到的边界不显式依赖于网络的深度。我们在 MNIST 和 CIFAR-10 数据集上进行了实验证实。

基于切空间敏感性的 ReLU 网络的优化相关泛化界

Optimization dependent generalization bound for ReLU networks based on  sensitivity in the tangent bundle

We define a notion of information that an individual sample provides to the
training of a neural network, and we specialize it to measure both how much a
sample informs the final weights and how much it informs the function computed
by the weights. Though related, we show that these quantities have a
qualitatively different behavior. We give efficient approximations of these
quantities using a linearized network and demonstrate empirically that the
approximation is accurate for real-world architectures, such as pre-trained
ResNets. We apply these measures to several problems, such as dataset
summarization, analysis of under-sampled classes, comparison of informativeness
of different data sources, and detection of adversarial and corrupted examples.
Our work generalizes existing frameworks but enjoys better computational
properties for heavily over-parametrized models, which makes it possible to
apply it to real-world networks.

研究提出了一种针对神经网络的信息定义，可以测量样本对模型训练的影响程度和其计算函数的影响程度，利用线性网络提供了这些量的高效近似值并应用于数据集的总结、不足采样类别的分析、不同数据源信息量的比较和识别对抗样本等多个问题。

采用平滑唯一信息估计样本信息价值

Estimating informativeness of samples with Smooth Unique Information

Large over-parametrized models learned via stochastic gradient descent (SGD)
methods have become a key element in modern machine learning. Although SGD
methods are very effective in practice, most theoretical analyses of SGD
suggest slower convergence than what is empirically observed. In our recent
work [8] we analyzed how interpolation, common in modern over-parametrized
learning, results in exponential convergence of SGD with constant step size for
convex loss functions. In this note, we extend those results to a much broader
non-convex function class satisfying the Polyak-Lojasiewicz (PL) condition. A
number of important non-convex problems in machine learning, including some
classes of neural networks, have been recently shown to satisfy the PL
condition. We argue that the PL condition provides a relevant and attractive
setting for many machine learning problems, particularly in the
over-parametrized regime.

该文研究了使用随机梯度下降方法学习的大型过度参数化模型的收敛速度，并证明了当损失函数为凸函数或满足 Polyak-Lojasiewicz 条件的广泛非凸函数类时，常数步长下 SGD 可以实现指数收敛。