Reinforcement Learning from Human Feedback (RLHF) is a widely adopted
approach for aligning large language models with human values. However, RLHF
relies on a reward model that is trained with a limited amount of human
preference data, which could lead to inaccurate predictions. As a result, RLHF
may produce outputs that are misaligned with human values. To mitigate this
issue, we contribute a reward ensemble method that allows the reward model to
make more accurate predictions. As using an ensemble of large language
model-based reward models can be computationally and resource-expensive, we
explore efficient ensemble methods including linear-layer ensemble and
LoRA-based ensemble. Empirically, we run Best-of-$n$ and Proximal Policy
Optimization with our ensembled reward models, and verify that our ensemble
methods help improve the alignment performance of RLHF outputs.

采用奖励集成方法，我们研究如何改进 Reinforcement Learning from Human Feedback (RLHF) 模型对人类价值观的对齐效果，通过使用多个大型语言模型的奖励模型集成，提高了 RLHF 输出的对齐性能。