As more machine learning agents interact with humans, it is increasingly a prospect that an agent trained to perform a task optimally, using only a measure of task performance as feedback, can violate societal norms for acceptable behavior or cause harm. Value alignment is a property of intelligent agents wherein they solely pursue non-harmful behaviors or human-beneficial goals. We introduce an approach to value-aligned reinforcement learning, in which we train an agent with two reward signals: a standard task performance reward, plus a normative behavior reward. The normative behavior reward is derived from a value-aligned prior model previously shown to classify text as normative or non-normative. We show how variations on a policy shaping technique can balance these two sources of reward and produce policies that are both effective and perceived as being more normative. We test our value-alignment technique on three interactive text-based worlds; each world is designed specifically to challenge agents with a task as well as provide opportunities to deviate from the task to engage in normative and/or altruistic behavior.

通过训练一种双重奖励信号的智能体，其中包括标准任务性能奖励和一个从价值对齐的先前模型派生的规范行为奖励，我们介绍了一种价值对齐的强化学习方法，并展示了如何使用策略塑形技术平衡这两种奖励信号，以便产生既有效又更规范的策略，在三个互动的基于文本的世界中对其进行了测试。

使用规范先验训练价值对齐强化学习智能体