Reinforcement Learning from Human Feedback (RLHF) can be used to capture
complex and nuanced properties of text generation quality. As a result, the
task of text summarization has been identified as a good candidate for this
process. In this paper, we explore how preference agreement impacts the
efficacy of RLHF for summarization. We show that sampling human preferences to
include a range of annotator agreement results in (1) higher accuracy reward
models and (2) alters the characteristics of quality captured. We additionally
show improvements in downstream generation when using a reward model trained
with a range of preference agreements. Our contributions have implications for
the design of synthetic datasets as well as the importance of considering
quality differentials in comparison-based data.

从人类反馈中学习强化学习（RLHF）可以捕捉到文本生成质量的复杂和微妙的特性。本文探讨了偏好一致性对于文本摘要中 RLHF 的有效性的影响，通过展示人类偏好的采样范围包含一系列的标注者一致性，我们证明了（1）更高准确率的奖励模型和（2）所捕捉到的质量特征的改变。此外，当使用训练有一系列偏好一致性的奖励模型时，我们还展示了下游生成方面的改进。我们的贡献对于合成数据集的设计以及在比较性数据中考虑质量差异的重要性具有影响。