在Atari中从人类偏好和演示中进行奖励学习

Nov, 2018

在Atari中从人类偏好和演示中进行奖励学习

Reward learning from human preferences and demonstrations in Atari

Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg...

TL;DR本研究使用深度神经网络进行强化学习，将人工反馈的目标作为奖励函数输入，并结合了专家演示与轨迹优先学习两种方法。实验在 9 个 Atari 游戏中超越了模仿学习的基线，并在其中 2 个游戏中获得了超人的表现，同时研究了奖励模型拟合度、奖励篡改问题和人类标签噪声的影响。

Abstract

To solve complex real-world problems with reinforcement learning, we cannot rely on manually specified reward functions. Instead, we can have humans communicate an objective to the agent directly. In this work, we combine two approaches to learning from →