While using shaped rewards can be beneficial when solving sparse reward
tasks, their successful application often requires careful engineering and is
problem specific. For instance, in tasks where the agent must achieve some goal
state, simple distance-to-goal reward shaping often fails, as it renders
learning vulnerable to local optima. We introduce a simple and effective
model-free method to learn from shaped distance-to-goal rewards on tasks where
success depends on reaching a goal state. Our method introduces an auxiliary
distance-based reward based on pairs of rollouts to encourage diverse
exploration. This approach effectively prevents learning dynamics from
stabilizing around local optima induced by the naive distance-to-goal reward
shaping and enables policies to efficiently solve sparse reward tasks. Our
augmented objective does not require any additional reward engineering or
domain expertise to implement and converges to the original sparse objective as
the agent learns to solve the task. We demonstrate that our method successfully
solves a variety of hard-exploration tasks (including maze navigation and 3D
construction in a Minecraft environment), where naive distance-based reward
shaping otherwise fails, and intrinsic curiosity and reward relabeling
strategies exhibit poor performance.

该研究介绍了一种基于辅助距离奖励的、简单且有效的无模型方法，使得机器学习智能体可以有效地解决用简单距离奖励难以解决的稀疏奖励任务，同时不需要额外的奖励工程或领域专业知识。

保持距离：通过自平衡的成形奖励解决稀疏奖励任务

Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing  Shaped Rewards

Goal-conditioned policies are used in order to break down complex
reinforcement learning (RL) problems by using subgoals, which can be defined
either in state space or in a latent feature space. This can increase the
efficiency of learning by using a curriculum, and also enables simultaneous
learning and generalization across goals. A crucial requirement of
goal-conditioned policies is to be able to determine whether the goal has been
achieved. Having a notion of distance to a goal is thus a crucial component of
this approach. However, it is not straightforward to come up with an
appropriate distance, and in some tasks, the goal space may not even be known a
priori. In this work we learn a distance-to-goal estimate which is computed in
terms of the number of actions that would need to be carried out in a
self-supervised approach. Our method solves complex tasks without prior domain
knowledge in the online setting in three different scenarios in the context of
goal-conditioned policies a) the goal space is the same as the state space b)
the goal space is given but an appropriate distance is unknown and c) the state
space is accessible, but only a subset of the state space represents desired
goals, and this subset is known a priori. We also propose a goal-generation
mechanism as a secondary contribution.

本文在使用子目标分解强化学习问题时，提出学习适当距离的方法以确定目标是否已实现，并就三种不同情境提出了解决方案，同时还提出了一个目标生成机制。