The goal of aligning language models to human preferences requires data that reveal these preferences. Ideally, time and money can be spent carefully collecting and tailoring bespoke preference data to each downstream application. However, in practice, a select few publicly available preference datasets are often used to train reward models for reinforcement learning from human feedback (RLHF). While new preference datasets are being introduced with increasing frequency, there are currently no existing efforts to measure and compare these datasets. In this paper, we systematically study preference datasets through three perspectives: scale, label noise, and information content. We propose specific metrics for each of these perspectives and uncover different axes of comparison for a better understanding of preference datasets. Our work is a first step towards a data-centric approach to alignment by providing perspectives that aid in training efficiency and iterative data collection for RLHF.

本研究针对现有偏好数据集缺乏比较与测量的问题，提出了一套系统的评估标准，涵盖规模、标签噪声和信息内容三个视角。研究结果为数据中心化的强化学习人类反馈提供了初步的理论支持，促进了训练效率和迭代数据收集的提升。

面向数据中心的强化学习人类反馈：偏好数据集比较的简单指标