路径特定目标以确保智能体奖励的安全性

Apr, 2022

路径特定目标以确保智能体奖励的安全性

Path-Specific Objectives for Safer Agent Incentives

Sebastian Farquhar, Ryan Carey, Tom Everitt

TL;DR本文提出了一个通用框架，用于训练安全代理，其天真的动机是不安全的。作为案例，本文讨论了操纵或欺骗性行为可以提高回报但应该避免的情况。我们在文中形式化地描述了中的“敏感”状态，它不应该作为达到目的的手段。我们使用因果影响图分析训练代理，以最大化动作对预期回报的因果效应，该因果效应不由敏感状态中介。通过使用该框架，我们进一步展示了如何统一和泛化现有方案。

Abstract

We present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided. Most approaches fail here: