Existing AI alignment approaches assume that preferences are static, which is
unrealistic: our preferences change, and may even be influenced by our
interactions with AI systems themselves. To clarify the consequences of
incorrectly assuming static preferences, we introduce Dynamic Reward Markov
Decision Processes (DR-MDPs), which explicitly model preference changes and the
AI's influence on them. We show that despite its convenience, the
static-preference assumption may undermine the soundness of existing alignment
techniques, leading them to implicitly reward AI systems for influencing user
preferences in ways users may not truly want. We then explore potential
solutions. First, we offer a unifying perspective on how an agent's
optimization horizon may partially help reduce undesirable AI influence. Then,
we formalize different notions of AI alignment that account for preference
change from the outset. Comparing the strengths and limitations of 8 such
notions of alignment, we find that they all either err towards causing
undesirable AI influence, or are overly risk-averse, suggesting that a
straightforward solution to the problems of changing preferences may not exist.
As there is no avoiding grappling with changing preferences in real-world
settings, this makes it all the more important to handle these issues with
care, balancing risks and capabilities. We hope our work can provide conceptual
clarity and constitute a first step towards AI alignment practices which
explicitly account for (and contend with) the changing and influenceable nature
of human preferences.

现有的 AI 对齐方法假设偏好是静态的，这是不现实的：我们的偏好会改变，甚至可能受到与 AI 系统的交互的影响。为了阐明错误地假设偏好是静态的后果，我们引入了动态回报马尔可夫决策过程 (DR-MDPs)，明确地模拟偏好变化和 AI 对其的影响。我们发现，尽管静态偏好的假设具有方便性，但它可能破坏现有对齐技术的准确性，使其暗地里奖励 AI 系统以影响用户偏好，而用户可能并不真正想要这样。然后，我们探讨了潜在的解决方案。首先，我们提供了一个统一的视角，阐述了一个代理的优化范围在某种程度上如何帮助减少不希望的 AI 影响。然后，我们从一开始就形式化了不同的 AI 对齐概念，考虑了偏好的变化。比较了 8 种这样的对齐概念的优缺点，发现它们要么倾向于引起不良的 AI 影响，要么过于风险回避，这表明解决偏好改变问题的简单解决方案可能不存在。由于在实际环境中无法避免处理不断变化的偏好，这使得我们更加重视如何平衡风险和能力来处理这些问题。我们希望我们的工作可以提供概念的清晰性，并成为针对人类偏好的变化性和可影响性明确考虑和应对的 AI 对齐实践的第一步。

与可变且可影响奖励函数保持 AI 对齐

AI Alignment with Changing and Influenceable Reward Functions

As artificial intelligence becomes more powerful and a ubiquitous presence in
daily life, it is imperative to understand and manage the impact of AI systems
on our lives and decisions. Modern ML systems often change user behavior (e.g.
personalized recommender systems learn user preferences to deliver
recommendations that change online behavior). An externality of behavior change
is preference change. This article argues for the establishment of a
multidisciplinary endeavor focused on understanding how AI systems change
preference: Preference Science. We operationalize preference to incorporate
concepts from various disciplines, outlining the importance of meta-preferences
and preference-change preferences, and proposing a preliminary framework for
how preferences change. We draw a distinction between preference change,
permissible preference change, and outright preference manipulation. A
diversity of disciplines contribute unique insights to this framework.

本文提出成立跨学科组织，聚焦于理解 AI 系统对个体决策偏好的影响，运用各学科概念对偏好进行操作化，提出偏好变化的框架，并明确了可接受的和不可接受的变化。