Value alignment, which aims to ensure that large language models (LLMs) and other AI agents behave in accordance with human values, is critical for ensuring safety and trustworthiness of these systems. A key component of value alignment is the modeling of human preferences as a representation of human values. In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences change, especially when these preferences are dominant (i.e., with probabilities near 0 or 1). We identify specific conditions where this sensitivity becomes significant for these models and discuss the practical implications for the robustness and safety of value alignment in AI systems.

本文研究了价值对齐的稳健性，揭示偏好模型对偏好变化的敏感性。我们发现，在Bradley-Terry和Plackett-Luce模型中，某些偏好的概率会随着其他偏好的变化而显著改变，尤其是在主导偏好的情况下。这一发现对AI系统的价值对齐的稳健性和安全性具有重要影响。

强偏好影响价值对齐的稳健性