The alignment of large language models (LLMs) with human values is crucial
for the development of artificial general intelligence (AGI). One promising
approach to achieve this alignment is reinforcement learning from human
feedback, which employs a reward model (RM) learned from human preference
datasets to guide LLMs in generating text that aligns with human preferences.
Through intensive experiments and analysis of reward distribution, this paper
finds that preference datasets are diverse from each other, even though they
are all proposed to align human preference. Hence, mixing diverse human
preference datasets to increase data size for enhancing reward modeling could
fail. To address the issue and capture the shared human values from diverse
preferences, a new training policy called MORE is introduced, which minimizes
preference bias by adaptively adjusting the preference objective across diverse
preferences. Experiments with the Pythia-1.4B model and five mixed preference
datasets show that MORE achieves superior reward accuracy and lower calibration
error, highlighting its ability to leverage diverse human preference data.

通过混合不同的人类偏好数据集以增加数据量来增强奖励建模的方法可能失败，因此该研究提出了一种名为 MORE 的新的训练策略，通过自适应调整偏好目标来捕捉不同偏好中的共享人类价值观，实验证明 MORE 相较于其他方法在奖励准确性和校准误差方面有更好的表现。