In many sequential decision making tasks, it is challenging to design reward
functions that help an RL agent efficiently learn behavior that is considered
good by the agent designer. A number of different formulations of the
reward-design problem, or close variants thereof, have been proposed in the
literature. In this paper we build on the Optimal Rewards Framework of Singh
et.al. that defines the optimal intrinsic reward function as one that when used
by an RL agent achieves behavior that optimizes the task-specifying or
extrinsic reward function. Previous work in this framework has shown how good
intrinsic reward functions can be learned for lookahead search based planning
agents. Whether it is possible to learn intrinsic reward functions for learning
agents remains an open problem. In this paper we derive a novel algorithm for
learning intrinsic rewards for policy-gradient based learning agents. We
compare the performance of an augmented agent that uses our algorithm to
provide additive intrinsic rewards to an A2C-based policy learner (for Atari
games) and a PPO-based policy learner (for Mujoco domains) with a baseline
agent that uses the same policy learners but with only extrinsic rewards. Our
results show improved performance on most but not all of the domains.

本文中，研究了在序列决策任务中，优化奖励函数对于强化学习的性能具有重要意义，提出了一种适用于基于策略梯度的学习代理的学习内在奖励的算法，并在性能上对比了使用该方法的强化学习代理和仅使用外在奖励的代理。