Large foundation models pretrained on raw web-scale data are not readily
deployable without additional step of extensive alignment to human preferences.
Such alignment is typically done by collecting large amounts of pairwise
comparisons from humans ("Do you prefer output A or B?") and learning a reward
model or a policy with the Bradley-Terry-Luce (BTL) model as a proxy for a
human's underlying implicit preferences. These methods generally suffer from
assuming a universal preference shared by all humans, which lacks the
flexibility of adapting to plurality of opinions and preferences. In this work,
we propose PAL, a framework to model human preference complementary to existing
pretraining strategies, which incorporates plurality from the ground up. We
propose using the ideal point model as a lens to view alignment using
preference comparisons. Together with our novel reformulation and using mixture
modeling, our framework captures the plurality of population preferences while
simultaneously learning a common preference latent space across different
preferences, which can few-shot generalize to new, unseen users. Our approach
enables us to use the penultimate-layer representation of large foundation
models and simple MLP layers to learn reward functions that are on-par with the
existing large state-of-the-art reward models, thereby enhancing efficiency of
reward modeling significantly. We show that PAL achieves competitive reward
model accuracy compared to strong baselines on 1) Language models with Summary
dataset ; 2) Image Generative models with Pick-a-Pic dataset ; 3) A new
semisynthetic heterogeneous dataset generated using Anthropic Personas.
Finally, our experiments also highlight the shortcoming of current preference
datasets that are created using rigid rubrics which wash away heterogeneity,
and call for more nuanced data collection approaches.

大规模基础模型预训练在原始网络数据上，无法直接部署，需要经过广泛的与人类偏好的协调。本文提出 PAL 框架，将人类偏好的多样性融入到预训练策略中，通过理想点模型和混合建模方法，捕捉到群体偏好的多样性，同时学习一种常用的偏好潜空间，能够适应新用户的少样本泛化。该方法利用基础模型的倒数第二层表示和简单的 MLP 层，学习与现有大型先进奖励模型相当的奖励函数，极大提升了奖励建模的效率。实验证明，PAL 在多个数据集上与基准模型相比，能够达到竞争性的奖励模型准确性，并揭示了当前偏好数据集的不足，呼吁采用更细致的数据收集方法。