Coherent Gradients (CGH) is a recently proposed hypothesis to explain why
over-parameterized neural networks trained with gradient descent generalize
well even though they have sufficient capacity to memorize the training set.
The key insight of CGH is that, since the overall gradient for a single step of
SGD is the sum of the per-example gradients, it is strongest in directions that
reduce the loss on multiple examples if such directions exist. In this paper,
we validate CGH on ResNet, Inception, and VGG models on ImageNet. Since the
techniques presented in the original paper do not scale beyond toy models and
datasets, we propose new methods. By posing the problem of suppressing weak
gradient directions as a problem of robust mean estimation, we develop a
coordinate-based median of means approach. We present two versions of this
algorithm, M3, which partitions a mini-batch into 3 groups and computes the
median, and a more efficient version RM3, which reuses gradients from previous
two time steps to compute the median. Since they suppress weak gradient
directions without requiring per-example gradients, they can be used to train
models at scale. Experimentally, we find that they indeed greatly reduce
overfitting (and memorization) and thus provide the first convincing evidence
that CGH holds at scale. We also propose a new test of CGH that does not depend
on adding noise to training labels or on suppressing weak gradient directions.
Using the intuition behind CGH, we posit that the examples learned early in the
training process (i.e., "easy" examples) are precisely those that have more in
common with other training examples. Therefore, as per CGH, the easy examples
should generalize better amongst themselves than the hard examples amongst
themselves. We validate this hypothesis with detailed experiments, and believe
that it provides further orthogonal evidence for CGH.

本文通过对 ResNet，Inception 和 VGG 等模型的实验验证了相干梯度假设，并提出了具有可扩展性的抑制弱梯度方向的方法，这是首次令当代的监督学习提供令人信服的概括能力证据。

弱梯度和强梯度方向：解释尺度下的记忆、推广和难度

Weak and Strong Gradient Directions: Explaining Memorization,  Generalization, and Hardness of Examples at Scale

An open question in the Deep Learning community is why neural networks
trained with Gradient Descent generalize well on real datasets even though they
are capable of fitting random data. We propose an approach to answering this
question based on a hypothesis about the dynamics of gradient descent that we
call Coherent Gradients: Gradients from similar examples are similar and so the
overall gradient is stronger in certain directions where these reinforce each
other. Thus changes to the network parameters during training are biased
towards those that (locally) simultaneously benefit many examples when such
similarity exists. We support this hypothesis with heuristic arguments and
perturbative experiments and outline how this can explain several common
empirical observations about Deep Learning. Furthermore, our analysis is not
just descriptive, but prescriptive. It suggests a natural modification to
gradient descent that can greatly reduce overfitting.

本文提出了一个关于如何解释神经网络使用梯度下降算法泛化能力较强的假设 Coherent Gradients，并支持该假设的启发式论证和简单实验证明。同时，该分析为防止过拟合提出了一种自然而然的梯度下降修改方法。