Reinforcement learning with human feedback (RLHF) is an emerging paradigm to
align models with human preferences. Typically, RLHF aggregates preferences
from multiple individuals who have diverse viewpoints that may conflict with
each other. Our work \textit{initiates} the theoretical study of multi-party
RLHF that explicitly models the diverse preferences of multiple individuals. We
show how traditional RLHF approaches can fail since learning a single reward
function cannot capture and balance the preferences of multiple individuals. To
overcome such limitations, we incorporate meta-learning to learn multiple
preferences and adopt different social welfare functions to aggregate the
preferences across multiple parties. We focus on the offline learning setting
and establish sample complexity bounds, along with efficiency and fairness
guarantees, for optimizing diverse social welfare functions such as Nash,
Utilitarian, and Leximin welfare functions. Our results show a separation
between the sample complexities of multi-party RLHF and traditional
single-party RLHF. Furthermore, we consider a reward-free setting, where each
individual's preference is no longer consistent with a reward model, and give
pessimistic variants of the von Neumann Winner based on offline preference
data. Taken together, our work showcases the advantage of multi-party RLHF but
also highlights its more demanding statistical complexity.

多方强化学习与人类反馈是新兴的方法，以使模型符合人类的偏好。本文通过理论研究，探讨了多个个体的多样化偏好的多方强化学习方法，并展示传统方法不适用的情况。文章提出了引入元学习以及采用不同的社会福利函数来聚合多方偏好的方式，其中包括纳什、功利主义和 Leximin 福利函数。同时，文章还考虑了无奖励设置，并给出了基于离线偏好数据的 von Neumann Winner 的悲观变体。研究结果表明，多方强化学习与传统单方强化学习在样本复杂度上存在差异，并凸显了多方强化学习的统计复杂性要求。