Can we modify the training data distribution to encourage the underlying optimization method toward finding solutions with superior generalization performance on in-distribution data? In this work, we approach this question for the first time by comparing the inductive bias of gradient descent (GD) with that of sharpness-aware minimization (SAM). By studying a two-layer CNN, we prove that SAM learns easy and difficult features more uniformly, particularly in early epochs. That is, SAM is less susceptible to simplicity bias compared to GD. Based on this observation, we propose USEFUL, an algorithm that clusters examples based on the network output early in training and upsamples examples with no easy features to alleviate the pitfalls of the simplicity bias. We show empirically that modifying the training data distribution in this way effectively improves the generalization performance on the original data distribution when training with (S)GD by mimicking the training dynamics of SAM. Notably, we demonstrate that our method can be combined with SAM and existing data augmentation strategies to achieve, to the best of our knowledge, state-of-the-art performance for training ResNet18 on CIFAR10, STL10, CINIC10, Tiny-ImageNet; ResNet34 on CIFAR100; and VGG19 and DenseNet121 on CIFAR10.

我们通过比较梯度下降（GD）和锐度感知最小化（SAM）的归纳偏差，证明了SAM在早期阶段更均匀地学习易于和困难的特征，因此我们提出了一种基于网络输出的示例聚类算法并上采样那些没有易于特征的示例，从而改善了原始数据分布上（S）GD的泛化性能。同时，我们证明该方法与SAM和现有的数据增强策略相结合，在CIFAR10、STL10、CINIC10、Tiny-ImageNet上训练ResNet18，在CIFAR100上训练ResNet34，以及在CIFAR10上训练VGG19和DenseNet121中，取得了目前最佳的性能。

充分利用数据：改变训练数据分布以提高内分布泛化性能