Adversarial training is a widely used method to improve the robustness of
deep neural networks (DNNs) over adversarial perturbations. However, it is
empirically observed that adversarial training on over-parameterized networks
often suffers from the \textit{robust overfitting}: it can achieve almost zero
adversarial training error while the robust generalization performance is not
promising. In this paper, we provide a theoretical understanding of the
question of whether overfitted DNNs in adversarial training can generalize from
an approximation viewpoint. Specifically, our main results are summarized into
three folds: i) For classification, we prove by construction the existence of
infinitely many adversarial training classifiers on over-parameterized DNNs
that obtain arbitrarily small adversarial training error (overfitting), whereas
achieving good robust generalization error under certain conditions concerning
the data quality, well separated, and perturbation level. ii) Linear
over-parameterization (meaning that the number of parameters is only slightly
larger than the sample size) is enough to ensure such existence if the target
function is smooth enough. iii) For regression, our results demonstrate that
there also exist infinitely many overfitted DNNs with linear
over-parameterization in adversarial training that can achieve almost optimal
rates of convergence for the standard generalization error. Overall, our
analysis points out that robust overfitting can be avoided but the required
model capacity will depend on the smoothness of the target function, while a
robust generalization gap is inevitable. We hope our analysis will give a
better understanding of the mathematical foundations of robustness in DNNs from
an approximation view.

对深度神经网络（DNN）进行逆向训练以提高其对敌对扰动的鲁棒性是一种被广泛采用的方法。然而，经验观察到，对于超参数化网络的逆向训练往往存在 “鲁棒过拟合” 问题：它可以实现接近于零的逆向训练错误，但鲁棒性的泛化性能不佳。本文从逼近的角度对逆向训练中过拟合的 DNN 的泛化能力进行了理论研究，得出了三个主要结果：i）对于分类问题，我们通过构造证明在超参数化的 DNNs 上存在无穷多的逆向训练分类器，可以获得任意小的逆向训练错误（过拟合），同时在数据质量、明显分离和扰动水平等方面满足一定条件时可以获得良好的鲁棒泛化误差。ii）只要目标函数足够平滑，线性超参数化（即参数数量略大于样本大小）就足以确保这种存在性。iii）对于回归问题，我们的结果证明，在逆向训练中存在无穷多的超参数化过拟合 DNNs，可以实现几乎最优的标准泛化误差收敛速率。总体来说，我们的分析指出，鲁棒过拟合是可以避免的，但所需的模型容量将取决于目标函数的平滑程度，而鲁棒泛化差距是不可避免的。我们希望我们的分析能够更好地从逼近的角度理解 DNNs 的鲁棒性的数学基础。

对抗训练中过拟合的深度神经网络能否泛化？— 一种近似观点

Can overfitted deep neural networks in adversarial training generalize?  -- An approximation viewpoint

Deep neural networks have achieved remarkable performance for artificial
intelligence tasks. The success behind intelligent systems often relies on
large-scale models with high computational complexity and storage costs. The
over-parameterized networks are often easy to optimize and can achieve better
performance. However, it is challenging to deploy them over resource-limited
edge-devices. Knowledge Distillation (KD) aims to optimize a lightweight
network from the perspective of over-parameterized training. The traditional
offline KD transfers knowledge from a cumbersome teacher to a small and fast
student network. When a sizeable pre-trained teacher network is unavailable,
online KD can improve a group of models by collaborative or mutual learning.
Without needing extra models, Self-KD boosts the network itself using attached
auxiliary architectures. KD mainly involves knowledge extraction and
distillation strategies these two aspects. Beyond KD schemes, various KD
algorithms are widely used in practical applications, such as multi-teacher KD,
cross-modal KD, attention-based KD, data-free KD and adversarial KD. This paper
provides a comprehensive KD survey, including knowledge categories,
distillation schemes and algorithms, as well as some empirical studies on
performance comparison. Finally, we discuss the open challenges of existing KD
works and prospect the future directions.

这篇论文提供了一份全面的知识蒸馏调查，包括知识类别、蒸馏方案和算法，以及一些性能比较的实证研究。

基于响应、特征和关系的知识蒸馏分类

Categories of Response-Based, Feature-Based, and Relation-Based  Knowledge Distillation

Recent theoretical works based on the neural tangent kernel (NTK) have shed
light on the optimization and generalization of over-parameterized networks,
and partially bridge the gap between their practical success and classical
learning theory. Especially, using the NTK-based approach, the following three
representative results were obtained: (1) A training error bound was derived to
show that networks can fit any finite training sample perfectly by reflecting a
tighter characterization of training speed depending on the data complexity.
(2) A generalization error bound invariant of network size was derived by using
a data-dependent complexity measure (CMD). It follows from this CMD bound that
networks can generalize arbitrary smooth functions. (3) A simple and analytic
kernel function was derived as indeed equivalent to a fully-trained network.
This kernel outperforms its corresponding network and the existing gold
standard, Random Forests, in few shot learning. For all of these results to
hold, the network scaling factor $\kappa$ should decrease w.r.t. sample size n.
In this case of decreasing $\kappa$, however, we prove that the aforementioned
results are surprisingly erroneous. It is because the output value of trained
network decreases to zero when $\kappa$ decreases w.r.t. n. To solve this
problem, we tighten key bounds by essentially removing $\kappa$-affected
values. Our tighter analysis resolves the scaling problem and enables the
validation of the original NTK-based results.

使用神经切比洛夫核方法，获得了网络训练误差上限、网络大小不变的泛化误差上限，以及一个简单且解析的核函数，能够优于相关网络，但需要注意网络缩放因子的问题。本文对原有方法进行修正，提出了更加严格的误差上限，解决了缩放问题。

神经切向核方法的神经网络修正

A Revision of Neural Tangent Kernel-based Approaches for Neural Networks

The hypothesis that sub-network initializations (lottery) exist within the
initializations of over-parameterized networks, which when trained in isolation
produce highly generalizable models, has led to crucial insights into network
initialization and has enabled efficient inferencing. Supervised models with
uncalibrated confidences tend to be overconfident even when making wrong
prediction. In this paper, for the first time, we study how explicit confidence
calibration in the over-parameterized network impacts the quality of the
resulting lottery tickets. More specifically, we incorporate a suite of
calibration strategies, ranging from mixup regularization, variance-weighted
confidence calibration to the newly proposed likelihood-based calibration and
normalized bin assignment strategies. Furthermore, we explore different
combinations of architectures and datasets, and make a number of key findings
about the role of confidence calibration. Our empirical studies reveal that
including calibration mechanisms consistently lead to more effective lottery
tickets, in terms of accuracy as well as empirical calibration metrics, even
when retrained using data with challenging distribution shifts with respect to
the source dataset.

本文首次研究了过参数化网络中显式置信度校准对产生的 Lottery Tickets 的影响，并发现采用校准机制可以更有效地提高准确性和校准度。