Deep Reinforcement Learning is widely used for aligning Large Language Models
(LLM) with human preference. However, the conventional reward modelling has
predominantly depended on human annotations provided by a select cohort of
individuals. Such dependence may unintentionally result in models that are
skewed to reflect the inclinations of these annotators, thereby failing to
represent the expectations of the wider population adequately. In this paper,
we introduce the Distributional Preference Reward Model (DPRM), a simple yet
effective framework to align large language models with a diverse set of human
preferences. To this end, we characterize the preferences by a beta
distribution, which can dynamically adapt to fluctuations in preference trends.
On top of that, we design an optimal-transportation-based loss to calibrate
DPRM to align with the preference distribution. Finally, the expected reward is
utilized to fine-tune an LLM policy to generate responses favoured by the
population. Our experiments show that DPRM significantly enhances the alignment
of LLMs with population preference, yielding more accurate, unbiased, and
contextually appropriate responses.

分布偏好奖励模型（DPRM）是一个简单而有效的框架，通过将最大语言模型（LLM）与多样化的人类偏好对齐，以提高对人群偏好的代表性。