Overparameterized deep networks have the capacity to memorize training data
with zero \emph{training error}. Even after memorization, the \emph{training
loss} continues to approach zero, making the model overconfident and the test
performance degraded. Since existing regularizers do not directly aim to avoid
zero training loss, it is hard to tune their hyperparameters in order to
maintain a fixed/preset level of training loss. We propose a direct solution
called \emph{flooding} that intentionally prevents further reduction of the
training loss when it reaches a reasonably small value, which we call the
\emph{flood level}. Our approach makes the loss float around the flood level by
doing mini-batched gradient descent as usual but gradient ascent if the
training loss is below the flood level. This can be implemented with one line
of code and is compatible with any stochastic optimizer and other regularizers.
With flooding, the model will continue to "random walk" with the same non-zero
training loss, and we expect it to drift into an area with a flat loss
landscape that leads to better generalization. We experimentally show that
flooding improves performance and, as a byproduct, induces a double descent
curve of the test loss.

本研究提出了一种称为 flooding 的解决方案，通过限制训练误差在一个合理的小值范围内，以达到更好的泛化效果，并在实验中证明了有效性。

在达成零训练误差后，我们是否需要零训练损失？

Do We Need Zero Training Loss After Achieving Zero Training Error?

The ability of overparameterized deep networks to generalize well has been
linked to the fact that stochastic gradient descent (SGD) finds solutions that
lie in flat, wide minima in the training loss -- minima where the output of the
network is resilient to small random noise added to its parameters. So far this
observation has been used to provide generalization guarantees only for neural
networks whose parameters are either \textit{stochastic} or
\textit{compressed}. In this work, we present a general PAC-Bayesian framework
that leverages this observation to provide a bound on the original network
learned -- a network that is deterministic and uncompressed. What enables us to
do this is a key novelty in our approach: our framework allows us to show that
if on training data, the interactions between the weight matrices satisfy
certain conditions that imply a wide training loss minimum, these conditions
themselves {\em generalize} to the interactions between the matrices on test
data, thereby implying a wide test loss minimum. We then apply our general
framework in a setup where we assume that the pre-activation values of the
network are not too small (although we assume this only on the training data).
In this setup, we provide a generalization guarantee for the original
(deterministic, uncompressed) network, that does not scale with product of the
spectral norms of the weight matrices -- a guarantee that would not have been
possible with prior approaches.

本文研究了过参数化的深层网络使用随机梯度下降法（SGD）能够良好推广的能力，提出了一种 PAC-Bayesian 框架，利用这种能力为原始网络提供界限，同时不会受到权重矩阵谱范数乘积的影响。

通过泛化噪音鲁棒性，确定性 PAC-Bayesian 深度网络泛化界

Deterministic PAC-Bayesian generalization bounds for deep networks via  generalizing noise-resilience

Aimed at explaining the surprisingly good generalization behavior of
overparameterized deep networks, recent works have developed a variety of
generalization bounds for deep learning, all based on the fundamental
learning-theoretic technique of uniform convergence. While it is well-known
that many of these existing bounds are numerically large, through numerous
experiments, we bring to light a more concerning aspect of these bounds: in
practice, these bounds can {\em increase} with the training dataset size.
Guided by our observations, we then present examples of overparameterized
linear classifiers and neural networks trained by gradient descent (GD) where
uniform convergence provably cannot "explain generalization" -- even if we take
into account the implicit bias of GD {\em to the fullest extent possible}. More
precisely, even if we consider only the set of classifiers output by GD, which
have test errors less than some small $\epsilon$ in our settings, we show that
applying (two-sided) uniform convergence on this set of classifiers will yield
only a vacuous generalization guarantee larger than $1-\epsilon$. Through these
findings, we cast doubt on the power of uniform convergence-based
generalization bounds to provide a complete picture of why overparameterized
deep networks generalize well.

通过实验结果，揭示了现存深度学习的多种基于均匀收敛理论的泛化界都是数值较大，因而引起了人们的质疑。而对于使用 GD 训练的超参数线性分类器和神经网络，即使我们考虑 GD 的隐式偏差，两边的均匀收敛都无法解释泛化，使得基于均匀收敛的泛化界失去了其解释能力。