Reward design is a critical part of the application of reinforcement
learning, the performance of which strongly depends on how well the reward
signal frames the goal of the designer and how well the signal assesses
progress in reaching that goal. In many cases, the extrinsic rewards provided
by the environment (e.g., win or loss of a game) are very sparse and make it
difficult to train agents directly. Researchers usually assist the learning of
agents by adding some auxiliary rewards in practice. However, designing
auxiliary rewards is often turned to a trial-and-error search for reward
settings that produces acceptable results. In this paper, we propose to
automatically generate goal-consistent intrinsic rewards for the agent to
learn, by maximizing which the expected accumulative extrinsic rewards can be
maximized. To this end, we introduce the concept of motivation which captures
the underlying goal of maximizing certain rewards and propose the motivation
based reward design method. The basic idea is to shape the intrinsic rewards by
minimizing the distance between the intrinsic and extrinsic motivations. We
conduct extensive experiments and show that our method performs better than the
state-of-the-art methods in handling problems of delayed reward, exploration,
and credit assignment.

本文提出了一种基于动机的奖励设计方法，自动生成目标一致的内在奖励，以最大程度地增大期望的累积外在奖励，该方法在处理延迟奖励、探索和信用分配问题方面优于现有方法。

通过学习动机一致的内在回报自动设计奖励

Automatic Reward Design via Learning Motivation-Consistent Intrinsic Rewards

This paper explores a simple regularizer for reinforcement learning by
proposing Generative Adversarial Self-Imitation Learning (GASIL), which
encourages the agent to imitate past good trajectories via generative
adversarial imitation learning framework. Instead of directly maximizing
rewards, GASIL focuses on reproducing past good trajectories, which can
potentially make long-term credit assignment easier when rewards are sparse and
delayed. GASIL can be easily combined with any policy gradient objective by
using GASIL as a learned shaped reward function. Our experimental results show
that GASIL improves the performance of proximal policy optimization on 2D Point
Mass and MuJoCo environments with delayed reward and stochastic dynamics.

本文提出了一种基于生成对抗自模仿学习 (GASIL) 的简单正则化方法，旨在通过基于生成对抗模仿学习框架鼓励智能体模仿过去的良好轨迹，而非直接最大化奖励，从而在奖励稀疏和滞后时更容易进行长期信用分配。通过使用 GASIL 作为学习形状奖励函数，将其与任何策略梯度目标轻松结合。实验结果显示，GASIL 改进了 2D PointMass 和 MuJoCo 环境下基于接近策略优化的性能。

生成敌对自模仿学习

Generative Adversarial Self-Imitation Learning

As a step towards developing zero-shot task generalization capabilities in
reinforcement learning (RL), we introduce a new RL problem where the agent
should learn to execute sequences of instructions after learning useful skills
that solve subtasks. In this problem, we consider two types of generalizations:
to previously unseen instructions and to longer sequences of instructions. For
generalization over unseen instructions, we propose a new objective which
encourages learning correspondences between similar subtasks by making
analogies. For generalization over sequential instructions, we present a
hierarchical architecture where a meta controller learns to use the acquired
skills for executing the instructions. To deal with delayed reward, we propose
a new neural architecture in the meta controller that learns when to update the
subtask, which makes learning more efficient. Experimental results on a
stochastic 3D domain show that the proposed ideas are crucial for
generalization to longer instructions as well as unseen instructions.

在强化学习中，我们介绍了一种新的强化学习问题，其中代理需要在学习解决子任务的有用技能后学习执行指令序列。我们考虑到先前未见的指令和更长的指令序列的泛化，为此，我们提出了一种基于类比的新目标和一个层次结构架构，并提出了一个新的神经网络架构来解决延迟奖励问题，实验结果表明这些提议对于泛化到较长指令序列以及未见指令是至关重要的。

多任务深度强化学习中的零样本任务泛化

Zero-Shot Task Generalization with Multi-Task Deep Reinforcement  Learning

Reinforcement learning optimizes policies for expected cumulative reward.
Need the supervision be so narrow? Reward is delayed and sparse for many tasks,
making it a difficult and impoverished signal for end-to-end optimization. To
augment reward, we consider a range of self-supervised tasks that incorporate
states, actions, and successors to provide auxiliary losses. These losses offer
ubiquitous and instantaneous supervision for representation learning even in
the absence of reward. While current results show that learning from reward
alone is feasible, pure reinforcement learning methods are constrained by
computational and data efficiency issues that can be remedied by auxiliary
losses. Self-supervised pre-training and joint optimization improve the data
efficiency and policy returns of end-to-end reinforcement learning.

本文探讨了如何通过自我监督预训练和联合优化来增加辅助损失，提高强化学习中的数据效率和策略回报。