In practice, preference learning from human feedback depends on incomplete
data with hidden context. Hidden context refers to data that affects the
feedback received, but which is not represented in the data used to train a
preference model. This captures common issues of data collection, such as
having human annotators with varied preferences, cognitive processes that
result in seemingly irrational behavior, and combining data labeled according
to different criteria. We prove that standard applications of preference
learning, including reinforcement learning from human feedback (RLHF),
implicitly aggregate over hidden contexts according to a well-known voting rule
called Borda count. We show this can produce counter-intuitive results that are
very different from other methods which implicitly aggregate via expected
utility. Furthermore, our analysis formalizes the way that preference learning
from users with diverse values tacitly implements a social choice function. A
key implication of this result is that annotators have an incentive to
misreport their preferences in order to influence the learned model, leading to
vulnerabilities in the deployment of RLHF. As a step towards mitigating these
problems, we introduce a class of methods called distributional preference
learning (DPL). DPL methods estimate a distribution of possible score values
for each alternative in order to better account for hidden context.
Experimental results indicate that applying DPL to RLHF for LLM chatbots
identifies hidden context in the data and significantly reduces subsequent
jailbreak vulnerability. Our code and data are available at
this https URL

通过分析人类的反馈学习中的偏好数据，我们发现隐藏背景信息可能导致一些反直觉的结果，从而引发强化学习算法的漏洞。为了减轻这些问题，我们引入了一种称为分布式偏好学习的方法，能够更好地考虑隐藏背景，并显著降低后续遭受攻击的概率。