Data-efficient learning has drawn significant attention, especially given the current trend of large multi-modal models, where dataset distillation can be an effective solution. However, the dataset distillation process itself is still very inefficient. In this work, we model the distillation problem with reference to information theory. Observing that severe data redundancy exists in dataset distillation, we argue to put more emphasis on the utility of the training samples. We propose a family of methods to exploit the most valuable samples, which is validated by our comprehensive analysis of the optimal data selection. The new strategy significantly reduces the training cost and extends a variety of existing distillation algorithms to larger and more diversified datasets, e.g. in some cases only 0.04% training data is sufficient for comparable distillation performance. Moreover, our strategy consistently enhances the performance, which may open up new analyses on the dynamics of distillation and networks. Our method is able to extend the distillation algorithms to much larger-scale datasets and more heterogeneous datasets, e.g. ImageNet-1K and Kinetics-400. Our code will be made publicly available.

本文提出了一种基于信息理论和样本价值的新的数据集精简方法，经过全面的数据选择分析，该方法能够极大的降低训练成本，扩展现有的精简算法到更大规模、更多元化的数据集上，并且能够在多种不同类型的数据集上持续提高性能。

从大型矿石中提炼金：通过关键样本选择实现高效数据集精馏