We present a general framework for training safe agents whose naive
incentives are unsafe. As an example, manipulative or deceptive behaviour can
improve rewards but should be avoided. Most approaches fail here: agents
maximize expected return by any means necessary. We formally describe settings
with 'delicate' parts of the state which should not be used as a means to an
end. We then train agents to maximize the causal effect of actions on the
expected return which is not mediated by the delicate parts of state, using
Causal Influence Diagram analysis. The resulting agents have no incentive to
control the delicate state. We further show how our framework unifies and
generalizes existing proposals.

本文提出了一个通用框架，用于训练安全代理，其天真的动机是不安全的。作为案例，本文讨论了操纵或欺骗性行为可以提高回报但应该避免的情况。我们在文中形式化地描述了中的 “敏感” 状态，它不应该作为达到目的的手段。我们使用因果影响图分析训练代理，以最大化动作对预期回报的因果效应，该因果效应不由敏感状态中介。通过使用该框架，我们进一步展示了如何统一和泛化现有方案。

路径特定目标以确保智能体奖励的安全性

Path-Specific Objectives for Safer Agent Incentives

Which variables does an agent have an incentive to control with its decision,
and which variables does it have an incentive to respond to? We formalise these
incentives, and demonstrate unique graphical criteria for detecting them in any
single decision causal influence diagram. To this end, we introduce structural
causal influence models, a hybrid of the influence diagram and structural
causal model frameworks. Finally, we illustrate how these incentives predict
agent incentives in both fairness and AI safety applications.

形式化了代理随决策而控制的变量及响应的变量的激励机制，并演示了在任何单一决策因果影响图中，检测这些激励机制的独特图形标准；引入了结构因果影响模型，它是影响图和结构因果模型框架的混合体；最后，说明了这些激励机制如何预测公正和人工智能安全应用中的代理激励。