Recent works have demonstrated that neural networks exhibit extreme
simplicity bias(SB). That is, they learn only the simplest features to solve a
task at hand, even in the presence of other, more robust but more complex
features. Due to the lack of a general and rigorous definition of features,
these works showcase SB on semi-synthetic datasets such as Color-MNIST,
MNIST-CIFAR where defining features is relatively easier.
In this work, we rigorously define as well as thoroughly establish SB for one
hidden layer neural networks. More concretely, (i) we define SB as the network
essentially being a function of a low dimensional projection of the inputs (ii)
theoretically, we show that when the data is linearly separable, the network
primarily depends on only the linearly separable ($1$-dimensional) subspace
even in the presence of an arbitrarily large number of other, more complex
features which could have led to a significantly more robust classifier, (iii)
empirically, we show that models trained on real datasets such as Imagenette
and Waterbirds-Landbirds indeed depend on a low dimensional projection of the
inputs, thereby demonstrating SB on these datasets, iv) finally, we present a
natural ensemble approach that encourages diversity in models by training
successive models on features not used by earlier models, and demonstrate that
it yields models that are significantly more robust to Gaussian noise.

该研究通过严格定义和深入探究神经网络的简单性偏差，理论上和经验上均证明在解决任务时只学习低维度输入的特征，不依赖于更复杂的特征，同时提出一种基于特征的训练顺序的集成方法，能够使得模型对高斯噪声具有更强的鲁棒性。

一层隐藏层神经网络中的简单性偏差

Simplicity Bias in 1-Hidden Layer Neural Networks

Can a neural network minimizing cross-entropy learn linearly separable data?
Despite progress in the theory of deep learning, this question remains
unsolved. Here we prove that SGD globally optimizes this learning problem for a
two-layer network with Leaky ReLU activations. The learned network can in
principle be very complex. However, empirical evidence suggests that it often
turns out to be approximately linear. We provide theoretical support for this
phenomenon by proving that if network weights converge to two weight clusters,
this will imply an approximately linear decision boundary. Finally, we show a
condition on the optimization that leads to weight clustering. We provide
empirical results that validate our theoretical analysis.

本文证明了，通过 SGD 训练具有 Leaky ReLU 激活函数的两层神经网络，可以在全局最小化交叉熵的同时学习线性可分数据，且学习的网络具有较为简单的近似线性决策边界。同时，本文提出了一种可以发现权重聚类的优化条件，并通过实验证明了理论分析的正确性。