We investigate the question: if an AI agent is known to be safe in one
setting, is it also safe in a new setting similar to the first? This is a core
question of AI alignment--we train and test models in a certain environment,
but deploy them in another, and we need to guarantee that models that seem safe
in testing remain so in deployment. Our notion of safety is based on
power-seeking--an agent which seeks power is not safe. In particular, we focus
on a crucial type of power-seeking: resisting shutdown. We model agents as
policies for Markov decision processes, and show (in two cases of interest)
that not resisting shutdown is "stable": if an MDP has certain policies which
don't avoid shutdown, the corresponding policies for a similar MDP also don't
avoid shutdown. We also show that there are natural cases where safety is _not_
stable--arbitrarily small perturbations may result in policies which never shut
down. In our first case of interest--near-optimal policies--we use a
bisimulation metric on MDPs to prove that small perturbations won't make the
agent take longer to shut down. Our second case of interest is policies for
MDPs satisfying certain constraints which hold for various models (including
language models). Here, we demonstrate a quantitative bound on how fast the
probability of not shutting down can increase: by defining a metric on MDPs;
proving that the probability of not shutting down, as a function on MDPs, is
lower semicontinuous; and bounding how quickly this function decreases.

如果一个 AI 代理在一个设置中被认为是安全的，那么它在一个类似的新设置中也是安全的；我们研究了 AI 对齐的一个核心问题 —— 我们训练和测试模型在一定的环境中，但在部署中需要确保在测试中被认为是安全的模型仍然是安全的；我们的安全概念基于追求权力，追求权力的代理是不安全的；我们以马尔科夫决策过程为模型，研究代理是否会抵抗关闭的关键类型的追求权力；我们还展示了在某些情况下安全是不稳定的，微小的扰动可能导致代理永远不关闭；我们还通过在 MDP 上定义一个双模拟度量来研究近似最优策略的情况，证明微小的扰动不会导致代理关闭所需时间变长；我们还研究了满足特定约束的 MDP 的策略，该约束适用于各种模型，包括语言模型，在这里，我们量化了不关闭的概率增加速度的界限：通过在 MDP 上定义一个度量；证明不关闭的概率作为 MDP 上的函数是下半连续的；并且给出了这个函数减小的速度上界。

人工智能代理的非追求权力的稳定性量化

Quantifying stability of non-power-seeking in artificial agents

Rapid advancements in artificial intelligence (AI) have sparked growing
concerns among experts, policymakers, and world leaders regarding the potential
for increasingly advanced AI systems to pose existential risks. This paper
reviews the evidence for existential risks from AI via misalignment, where AI
systems develop goals misaligned with human values, and power-seeking, where
misaligned AIs actively seek power. The review examines empirical findings,
conceptual arguments and expert opinion relating to specification gaming, goal
misgeneralization, and power-seeking. The current state of the evidence is
found to be concerning but inconclusive regarding the existence of extreme
forms of misaligned power-seeking. Strong empirical evidence of specification
gaming combined with strong conceptual evidence for power-seeking make it
difficult to dismiss the possibility of existential risk from misaligned
power-seeking. On the other hand, to date there are no public empirical
examples of misaligned power-seeking in AI systems, and so arguments that
future systems will pose an existential risk remain somewhat speculative. Given
the current state of the evidence, it is hard to be extremely confident either
that misaligned power-seeking poses a large existential risk, or that it poses
no existential risk. The fact that we cannot confidently rule out existential
risk from AI via misaligned power-seeking is cause for serious concern.

人工智能的快速发展引发了专家、决策者和世界领袖的担忧，关于越来越先进的人工智能系统可能造成的存在风险，这篇论文通过研究规范游戏、目标误归纳和寻求权力来审查关于人工智能存在风险的证据。该论文发现目前的证据状况令人担忧但不确定，关于存在极端的不协调寻求权力的可能性。强有力的规范游戏经验证据加上寻求权力的有力概念论证，使得很难排除由于不协调寻求权力而带来的存在风险的可能性。另一方面，迄今为止，还没有公开的关于人工智能系统中不协调寻求权力的实证例子，因此对未来系统将带来存在风险的论点仍然有些推测性质。鉴于目前的证据状况，我们很难非常确信不协调寻求权力存在巨大的存在风险，或者它不构成存在风险。无法有把握地排除人工智能通过不协调寻求权力带来存在风险的事实是令人严重担忧的。