Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment
of large language models (LLMs) with human preferences, thereby enhancing the
quality of responses generated. A critical component of RLHF is the reward
model, which is trained on preference data and outputs a scalar reward during
the inference stage. However, the collection of preference data still lacks
thorough investigation. Recent studies indicate that preference data is
collected either by AI or humans, where chosen and rejected instances are
identified among pairwise responses. We question whether this process
effectively filters out noise and ensures sufficient diversity in collected
data. To address these concerns, for the first time, we propose a comprehensive
framework for preference data collection, decomposing the process into four
incremental steps: Prompt Generation, Response Generation, Response Filtering,
and Human Labeling. This structured approach ensures the collection of
high-quality preferences while reducing reliance on human labor. We conducted
comprehensive experiments based on the data collected at different stages,
demonstrating the effectiveness of the proposed data collection method.

通过人类反馈进行强化学习 (RLHF) 可以与人类偏好相协调，从而提高生成的响应质量。RLHF 的一个关键组成部分是奖励模型，在推理阶段通过对偏好数据进行训练并输出标量奖励。然而，对于偏好数据的收集仍缺乏详细的调查。最近的研究表明，偏好数据是通过人工智能或人类收集的，其中在两两响应中选择和拒绝实例。我们质疑这个过程是否有效地过滤噪音并确保收集到足够的多样性数据。为了解决这些问题，我们首次提出了一个全面的偏好数据收集框架，将该过程分解为四个递增步骤：提示生成、响应生成、响应筛选和人工标注。这种结构化方法确保了高质量的偏好数据收集，同时减少对人力的依赖。我们根据不同阶段收集的数据进行了全面的实验，证明了所提出的数据收集方法的有效性。