Ensuring AI models align with human values is essential for their safety and
functionality. Reinforcement learning from human feedback (RLHF) uses human
preferences to achieve this alignment. However, preferences sourced from
diverse populations can result in point estimates of human values that may be
sub-optimal or unfair to specific groups. We propose Pareto Optimal Preference
Learning (POPL), which frames discrepant group preferences as objectives with
potential trade-offs, aiming for policies that are Pareto-optimal on the
preference dataset. POPL utilizes Lexicase selection, an iterative process to
select diverse and Pareto-optimal solutions. Our empirical evaluations
demonstrate that POPL surpasses baseline methods in learning sets of reward
functions, effectively catering to distinct groups without access to group
numbers or membership labels. Furthermore, we illustrate that POPL can serve as
a foundation for techniques optimizing specific notions of group fairness,
ensuring inclusive and equitable AI model alignment.

通过使用 Pareto Optimal Preference Learning (POPL) 模型，借助于 Lexicase 筛选过程，本研究实证评估表明 POPL 在学习奖励函数方面超过基线方法，有效满足不同的群体需求，并确保包容和公平的人工智能模型对齐。