Many applications of reinforcement learning can be formalized as
goal-conditioned environments, where, in each episode, there is a "goal" that
affects the rewards obtained during that episode but does not affect the
dynamics. Various techniques have been proposed to improve performance in
goal-conditioned environments, such as automatic curriculum generation and goal
relabeling. In this work, we explore a connection between off-policy
reinforcement learning in goal-conditioned settings and knowledge distillation.
In particular: the current Q-value function and the target Q-value estimate are
both functions of the goal, and we would like to train the Q-value function to
match its target for all goals. We therefore apply Gradient-Based Attention
Transfer (Zagoruyko and Komodakis 2017), a knowledge distillation technique, to
the Q-function update. We empirically show that this can improve the
performance of goal-conditioned off-policy reinforcement learning when the
space of goals is high-dimensional. We also show that this technique can be
adapted to allow for efficient learning in the case of multiple simultaneous
sparse goals, where the agent can attain a reward by achieving any one of a
large set of objectives, all specified at test time. Finally, to provide
theoretical support, we give examples of classes of environments where (under
some assumptions) standard off-policy algorithms such as DDPG require at least
O(d^2) replay buffer transitions to learn an optimal policy, while our proposed
technique requires only O(d) transitions, where d is the dimensionality of the
goal and state space. Code is available at
this https URL

本篇论文研究了强化学习在目标条件环境下的表现，提出了一种基于知识蒸馏的 Q 值函数更新方法，可以显著提高高维度空间下的目标条件策略学习，同时在多目标学习中也可以有效应用。此外，本研究还提供了一些理论支持，表明所提出的方法只需要 O (d) 个转移数据就可以完成目标任务，相较于标准的离线算法 DDPG 的需要至少 O (d^2) 个转移数据学习一个最优策略。