While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language
Models (LLMs) with general, aggregate human preferences, it is suboptimal for
learning diverse, individual perspectives. In this work, we study Reinforcement
Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are
aligned to multiple (sometimes conflicting) preferences by modeling alignment
as a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong
single-objective baselines, we show that we can achieve personalized alignment
by decomposing preferences into multiple dimensions. These dimensions are
defined based on personalizations that are declared as desirable by the user.
In this work, we show that they can be efficiently trained independently in a
distributed manner and combined effectively post-hoc through parameter merging.
The code is available at this https URL

通过将 Reinforcement Learning from Human Feedback (RLHF) 转变为 Reinforcement Learning from Personalized Human Feedback (RLPHF)，通过多目标强化学习问题的建模，可以实现 LLMs 与个人偏好的个性化对齐。通过将偏好维度进行分解，并在分布式环境中独立有效地进行训练，最后通过参数合并有效地实现多维度的个性化对齐。