Reinforcement Learning from Human Feedback (RLHF) is the current dominating
framework to fine-tune large language models to better align with human
preferences. However, the underlying premise of algorithms developed under this
framework can be problematic when user preferences encoded in human feedback
are diverse. In this work, we aim to address this problem by developing methods
for building personalized language models. We first formally introduce the task
of learning from personalized human feedback and explain why vanilla RLHF can
be problematic in this context. We then propose a general Personalized-RLHF
(P-RLHF) framework, which requires one to jointly learn a user model and a
language (or reward) model. The user model takes in user information and
outputs user representations. Its structure encodes our assumptions about user
preferences underlying the feedback data. We develop new learning objectives
for personalized reward modeling and personalized Direct Preference
Optimization. To demonstrate the efficacy of our method, we test it on
real-world text summarization data with annotated preferences and annotator
information. We fine-tune GPT-J 6B to obtain personalized language (and reward)
models, which outperform non-personalized models in terms of aligning with
individual preferences.

发展个性化语言模型的方法，结合用户模型和语言（或奖励）模型的学习目标，对个性化语言模型进行强化学习，以更好地满足用户偏好。