As language models (LMs) become more capable, it is increasingly important to
align them with human preferences. However, the dominant paradigm for training
Preference Models (PMs) for that purpose suffers from fundamental limitations,
such as lack of transparency and scalability, along with susceptibility to
overfitting the preference dataset. We propose Compositional Preference Models
(CPMs), a novel PM framework that decomposes one global preference assessment
into several interpretable features, obtains scalar scores for these features
from a prompted LM, and aggregates these scores using a logistic regression
classifier. CPMs allow to control which properties of the preference data are
used to train the preference model and to build it based on features that are
believed to underlie the human preference judgment. Our experiments show that
CPMs not only improve generalization and are more robust to overoptimization
than standard PMs, but also that best-of-n samples obtained using CPMs tend to
be preferred over samples obtained using conventional PMs. Overall, our
approach demonstrates the benefits of endowing PMs with priors about which
features determine human preferences while relying on LM capabilities to
extract those features in a scalable and robust way.

我们提出了一种新的偏好模型框架，即组合偏好模型（CPMs），它可以将一个全局偏好评估分解为多个可解释的特征，从提示的语言模型中获得这些特征的标量分数，并使用逻辑回归分类器聚合这些分数，实验证明，CPMs 不仅提高了泛化性能，并且对过度最优化更加稳健，而且使用 CPMs 获得的最佳样本往往优于传统偏好模型所获得的样本。