Aligning human preference and value is an important requirement for building
contemporary foundation models and embodied AI. However, popular approaches
such as reinforcement learning with human feedback (RLHF) break down the task
into successive stages, such as supervised fine-tuning (SFT), reward modeling
(RM), and reinforcement learning (RL), each performing one specific learning
task. Such a sequential approach results in serious issues such as significant
under-utilization of data and distribution mismatch between the learned reward
model and generated policy, which eventually lead to poor alignment
performance. We develop a single stage approach named Alignment with Integrated
Human Feedback (AIHF), capable of integrating both human preference and
demonstration to train reward models and the policy. The proposed approach
admits a suite of efficient algorithms, which can easily reduce to, and
leverage, popular alignment algorithms such as RLHF and Directly Policy
Optimization (DPO), and only requires minor changes to the existing alignment
pipelines. We demonstrate the efficiency of the proposed solutions with
extensive experiments involving alignment problems in LLMs and robotic control
problems in MuJoCo. We observe that the proposed solutions outperform the
existing alignment algorithms such as RLHF and DPO by large margins, especially
when the amount of high-quality preference data is relatively limited.

将人类偏好和价值观进行对齐是构建当代基础模型和具身化人工智能的重要需求。本文提出了一种名为 “AIHF（Alignment with Integrated Human Feedback）” 的单阶段方法，能够集成人类偏好和演示来训练奖励模型和策略，并通过大量实验证明该方法在语言模型和机器人控制问题的对齐中表现优于传统的强化学习算法如 RLHF 和 DPO，特别是当高质量偏好数据的数量相对有限时。