We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is H\"older continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input embeddings and $\lambda$ is the H\"older constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.

我们研究数据选择问题，将利用$k$-means聚类和敏感性抽样方法，基于模型损失的嵌入表示，可选择一组典型样本，其平均损失与整个数据集的平均损失相对应，具有可证明的性质，并且在微调基础模型上表现优于最先进的方法，同时展示了它如何应用于线性回归，提供了一个更简单且可扩展性更强的抽样策略。

基于聚类敏感性采样的数据高效学习：基础模型与扩展