Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To accurately estimate margins for data selection, we propose a dual-margin guided approach that considers both external reward margins and implicit DPO reward margins. Extensive experiments demonstrate that our method reduces computational cost dramatically while improving performance. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama and Mistral series models on the AlpacaEval 2.0 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, while further reducing training time. These results highlight the potential of data selection strategies for advancing preference optimization.

本研究解决了大型语言模型与人类偏好对齐中的数据选择问题，提出了一种新的边际最大化原则以指导数据集的策划，从而减少因噪声数据引起的参数收缩。实验表明，仅使用10%的Ultrafeedback数据集，我们的方法在多个模型上实现了3%到8%的性能提升，显著降低了计算成本，展示了数据选择在偏好优化中的潜力。

少即是多：通过偏好数据选择改善大型语言模型的对齐