A key requirement in developing Generative Language Models (GLMs) is to have
their values aligned with human values. Preference-based alignment is a widely
used paradigm for this purpose, in which preferences over generation pairs are
first elicited from human annotators or AI systems, and then fed into some
alignment techniques, e.g., Direct Preference Optimization. However, a
substantial percent (20 - 40%) of the preference pairs used in GLM alignment
are noisy, and it remains unclear how the noise affects the alignment
performance and how to mitigate its negative impact. In this paper, we propose
a framework to inject desirable amounts and types of noise to the preferences,
and systematically study the impact of preference noise on the alignment
performance in two tasks (summarization and dialogue generation). We find that
the alignment performance can be highly sensitive to the noise rates in the
preference data: e.g., a 10 percentage points (pp) increase of the noise rate
can lead to 30 pp drop in the alignment performance (in win rate). To mitigate
the impact of noise, confidence-based data filtering shows significant benefit
when certain types of noise are present. We hope our work can help the
community better understand and mitigate the impact of preference noise in GLM
alignment.

本文提出了一种注入不同类型和量级噪声的偏好框架，并在两个任务（文摘和对话生成）中系统地研究了偏好噪声对齐性能的影响。我们发现偏好数据中噪声率的增加会导致对齐性能的显著下降，并提出基于置信度的数据过滤方法以减少噪声的影响。我们希望这项工作可以帮助学界更好地理解和减轻 Generative Language Models 对偏好噪声的影响。