We present a general framework for training safe agents whose naive
incentives are unsafe. As an example, manipulative or deceptive behaviour can
improve rewards but should be avoided. Most approaches fail here: agents
maximize expected return by any means necessary. We formally describe settings
with 'delicate' parts of the state which should not be used as a means to an
end. We then train agents to maximize the causal effect of actions on the
expected return which is not mediated by the delicate parts of state, using
Causal Influence Diagram analysis. The resulting agents have no incentive to
control the delicate state. We further show how our framework unifies and
generalizes existing proposals.

本文提出了一个通用框架，用于训练安全代理，其天真的动机是不安全的。作为案例，本文讨论了操纵或欺骗性行为可以提高回报但应该避免的情况。我们在文中形式化地描述了中的 “敏感” 状态，它不应该作为达到目的的手段。我们使用因果影响图分析训练代理，以最大化动作对预期回报的因果效应，该因果效应不由敏感状态中介。通过使用该框架，我们进一步展示了如何统一和泛化现有方案。