More data helps us generalize to a task. But real datasets can contain out-of-distribution (OOD) data; this can come in the form of heterogeneity such as intra-class variability but also in the form of temporal shifts or concept drifts. We demonstrate a counter-intuitive phenomenon for such problems: generalization error of the task can be a non-monotonic function of the number of OOD samples; a small number of OOD samples can improve generalization but if the number of OOD samples is beyond a threshold, then the generalization error can deteriorate. We also show that if we know which samples are OOD, then using a weighted objective between the target and OOD samples ensures that the generalization error decreases monotonically. We demonstrate and analyze this issue using linear classifiers on synthetic datasets and medium-sized neural networks on CIFAR-10.

在本研究中，我们发现了一个反直觉的现象：在涉及目标任务的样本数量增加之前，由于少量的来自未知分布数据的样本，可以提高任务的泛化性能，但随着样本数量的增加，泛化误差达到阈值后会逐渐下降；我们采用合成数据集上的Fisher's Linear Discriminant和计算机视觉基准数据集（如MNIST、CIFAR-10、CINIC-10、PACS和DomainNet）上的深度神经网络来证明这一现象；在我们知道哪些样本是未知分布的理想情况下，我们可以使用适当加权的目标和外部风险的目标函数来利用这些非单调趋势，但其实际效用有限，此外，当我们不知道哪些样本是未知分布时，数据增强、超参数优化和预训练等常用策略仍然无法保证目标泛化误差不会随着未知分布样本数量的增加而下降。

离分布数据的价值