We present a method for learning intrinsic reward functions to drive the
learning of an agent during periods of practice in which extrinsic task rewards
are not available. During practice, the environment may differ from the one
available for training and evaluation with extrinsic rewards. We refer to this
setup of alternating periods of practice and objective evaluation as
practice-match, drawing an analogy to regimes of skill acquisition common for
humans in sports and games. The agent must effectively use periods in the
practice environment so that performance improves during matches. In the
proposed method the intrinsic practice reward is learned through a
meta-gradient approach that adapts the practice reward parameters to reduce the
extrinsic match reward loss computed from matches. We illustrate the method on
a simple grid world, and evaluate it in two games in which the practice
environment differs from match: Pong with practice against a wall without an
opponent, and PacMan with practice in a maze without ghosts. The results show
gains from learning in practice in addition to match periods over learning in
matches only.

该研究提出了一种通过学习内在奖励函数来驱动代理在练习期间学习，避免缺乏外在任务奖励的影响的方法，并通过元梯度法来适应练习奖励参数，该方法在格子世界以及两个游戏中进行了评估，显示了在练习和比赛中同时学习的优势。

一个代理应该如何练习？

How Should an Agent Practice?

In many sequential decision making tasks, it is challenging to design reward
functions that help an RL agent efficiently learn behavior that is considered
good by the agent designer. A number of different formulations of the
reward-design problem, or close variants thereof, have been proposed in the
literature. In this paper we build on the Optimal Rewards Framework of Singh
et.al. that defines the optimal intrinsic reward function as one that when used
by an RL agent achieves behavior that optimizes the task-specifying or
extrinsic reward function. Previous work in this framework has shown how good
intrinsic reward functions can be learned for lookahead search based planning
agents. Whether it is possible to learn intrinsic reward functions for learning
agents remains an open problem. In this paper we derive a novel algorithm for
learning intrinsic rewards for policy-gradient based learning agents. We
compare the performance of an augmented agent that uses our algorithm to
provide additive intrinsic rewards to an A2C-based policy learner (for Atari
games) and a PPO-based policy learner (for Mujoco domains) with a baseline
agent that uses the same policy learners but with only extrinsic rewards. Our
results show improved performance on most but not all of the domains.

本文中，研究了在序列决策任务中，优化奖励函数对于强化学习的性能具有重要意义，提出了一种适用于基于策略梯度的学习代理的学习内在奖励的算法，并在性能上对比了使用该方法的强化学习代理和仅使用外在奖励的代理。