Power-seeking behavior is a key source of risk from advanced AI, but our theoretical understanding of this phenomenon is relatively limited. Building on existing theoretical results demonstrating power-seeking incentives for most reward functions, we investigate how the training process affects power-seeking incentives and show that they are still likely to hold for trained agents under some simplifying assumptions. We formally define the training-compatible goal set (the set of goals consistent with the training rewards) and assume that the trained agent learns a goal from this set. In a setting where the trained agent faces a choice to shut down or avoid shutdown in a new situation, we prove that the agent is likely to avoid shutdown. Thus, we show that power-seeking incentives can be probable (likely to arise for trained agents) and predictive (allowing us to predict undesirable behavior in new situations).

研究表明，高级人工智能中的权力寻求行为是一种重要的风险来源，但目前对于这种现象的理论理解还相对有限。本文构建在现有的理论基础之上，研究了训练过程如何影响权力寻求激励，并证明了在一些简化的假设下，这种激励仍然可能存在于受过训练的智能体中，同时也能够预测新情况下的不良行为。

训练有素的机器代理人的寻求权力行为可以被预测