Reinforcement Learning from Human Feedback (RLHF) aligns language models to
human preferences by employing a singular reward model derived from preference
data. However, such an approach overlooks the rich diversity of human
preferences inherent in data collected from multiple users. In this work, we
first derive an impossibility result of alignment with single reward RLHF,
thereby highlighting its insufficiency in representing diverse human
preferences. To provide an equitable solution to the problem, we learn a
mixture of preference distributions via an expectation-maximization algorithm
and propose a MaxMin alignment objective for policy learning inspired by the
Egalitarian principle in social choice theory to better represent diverse human
preferences. We elucidate the connection of our proposed approach to
distributionally robust optimization and general utility RL, thereby
highlighting the generality and robustness of our proposed solution. We present
comprehensive experimental results on small-scale (GPT-2) and large-scale
language models (with Tulu2-7B) and show the efficacy of the proposed approach
in the presence of diversity among human preferences. Our algorithm achieves an
average improvement of more than 16% in win-rates over conventional RLHF
algorithms and improves the win-rate (accuracy) for minority groups by over 33%
without compromising the performance of majority groups, showcasing the
robustness and fairness of our approach. We remark that our findings in this
work are not only limited to language models but also extend to reinforcement
learning in general.

通过使用期望最大化算法，学习一种偏好分布的混合，以及基于社会选择理论中的平等原则提出一种最大最小对齐目标，提高代表多样化人类偏好的能力，并通过小规模和大规模语言模型的实验结果证明其有效性和公平性。