The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them. As the outcome, the increasing computational cost is becoming unaffordable. In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop. We propose a simple yet effective dataset pruning method by exploring both the prediction uncertainty and training dynamics. To our knowledge, this is the first work to study dataset pruning on large-scale datasets, i.e., ImageNet-1K and ImageNet-21K, and advanced models, i.e., Swin Transformer and ConvNeXt. Extensive experimental results indicate that our method outperforms the state of the art and achieves 75% lossless compression ratio on both ImageNet-1K and ImageNet-21K. The code and pruned datasets are available at https://github.com/BAAI-DCAI/Dataset-Pruning.

本文提出了一种数据集剪枝方法，利用预测不确定性和训练动态来生成一个信息量丰富的子集，从而用人口可承受的计算成本代替大规模数据集进行深度模型训练，实验结果表明该方法表现优于现有技术，对ImageNet-1K和ImageNet-21K数据集均获得了75%的无损压缩率。

动态不确定性下的大规模数据集修剪