Overparameterized deep neural networks (DNNs), if not sufficiently
regularized, are susceptible to overfitting their training examples and not
generalizing well to test data. To discourage overfitting, researchers have
developed multicomponent loss functions that reduce intra-class feature
correlation and maximize inter-class feature distance in one or more layers of
the network. By analyzing the penultimate feature layer activations output by a
DNN's feature extraction section prior to the linear classifier, we find that
modified forms of the intra-class feature covariance and inter-class prototype
separation are key components of a fundamental Chebyshev upper bound on the
probability of misclassification, which we designate the Chebyshev Prototype
Risk (CPR). While previous approaches' covariance loss terms scale
quadratically with the number of network features, our CPR bound indicates that
an approximate covariance loss in log-linear time is sufficient to reduce the
bound and is scalable to large architectures. We implement the terms of the CPR
bound into our Explicit CPR (exCPR) loss function and observe from empirical
results on multiple datasets and network architectures that our training
algorithm reduces overfitting and improves upon previous approaches in many
settings. Our code is available
$\href{this https URL}{here}$.

通过分析深度神经网络中特征提取层的激活输出，我们发现修正后的类内特征协方差和跨类原型分离是误分类概率的基本切比雪夫上界的关键组成部分，我们将其称为切比雪夫原型风险（CPR）。我们的实验结果显示，我们的训练算法在多个数据集和网络结构中减少了过拟合，并改进了先前方法。

魔术般地减小切比雪夫原型风险消除过拟合的危险

Minimizing Chebyshev Prototype Risk Magically Mitigates the Perils of  Overfitting

This paper focuses on over-parameterized deep neural networks (DNNs) with
ReLU activation functions and proves that when the data distribution is
well-separated, DNNs can achieve Bayes-optimal test error for classification
while obtaining (nearly) zero-training error under the lazy training regime.
For this purpose, we unify three interrelated concepts of overparameterization,
benign overfitting, and the Lipschitz constant of DNNs. Our results indicate
that interpolating with smoother functions leads to better generalization.
Furthermore, we investigate the special case where interpolating smooth
ground-truth functions is performed by DNNs under the Neural Tangent Kernel
(NTK) regime for generalization. Our result demonstrates that the
generalization error converges to a constant order that only depends on label
noise and initialization noise, which theoretically verifies benign
overfitting. Our analysis provides a tight lower bound on the normalized margin
under non-smooth activation functions, as well as the minimum eigenvalue of NTK
under high-dimensional settings, which has its own interest in learning theory.

该论文探讨和证明了过参数化的深度神经网络利用懒惰训练策略可以实现贝叶斯最优测试误差，同时获得（几乎）零训练误差，并提出了三个相关概念的统一。

深度神经网络的惰性训练下的良性过拟合

Benign Overfitting in Deep Neural Networks under Lazy Training

Machine learning systems, especially with overparameterized deep neural
networks, can generalize to novel test instances drawn from the same
distribution as the training data. However, they fare poorly when evaluated on
out-of-support test points. In this work, we tackle the problem of developing
machine learning systems that retain the power of overparameterized function
approximators while enabling extrapolation to out-of-support test points when
possible. This is accomplished by noting that under certain conditions, a
"transductive" reparameterization can convert an out-of-support extrapolation
problem into a problem of within-support combinatorial generalization. We
propose a simple strategy based on bilinear embeddings to enable this type of
combinatorial generalization, thereby addressing the out-of-support
extrapolation problem under certain conditions. We instantiate a simple,
practical algorithm applicable to various supervised learning and imitation
learning tasks.

本文研究了利用和超参数微调相关的重新参数化策略，增强深度学习系统在特定条件下的组合泛化能力，从而解决超域外推问题。该方法在各种监督学习和模仿学习任务中均具有实用性。