To perform robot manipulation tasks, a low-dimensional state of the
environment typically needs to be estimated. However, designing a state
estimator can sometimes be difficult, especially in environments with
deformable objects. An alternative is to learn an end-to-end policy that maps
directly from high-dimensional sensor inputs to actions. However, if this
policy is trained with reinforcement learning, then without a state estimator,
it is hard to specify a reward function based on high-dimensional observations.
To meet this challenge, we propose a simple indicator reward function for
goal-conditioned reinforcement learning: we only give a positive reward when
the robot's observation exactly matches a target goal observation. We show that
by relabeling the original goal with the achieved goal to obtain positive
rewards (Andrychowicz et al., 2017), we can learn with the indicator reward
function even in continuous state spaces. We propose two methods to further
speed up convergence with indicator rewards: reward balancing and reward
filtering. We show comparable performance between our method and an oracle
which uses the ground-truth state for computing rewards. We show that our
method can perform complex tasks in continuous state spaces such as rope
manipulation from RGB-D images, without knowledge of the ground-truth state.

提出了一种简单的指示器奖励函数，以解决在连续状态空间中使用强化学习训练策略时无法基于高维观测指定奖励函数的挑战；并提出奖励平衡和奖励过滤两种方法，以进一步加速使用指示器奖励函数的模型的收敛速度，并展示了在无需知道地面实况的情况下从 RGB-D 图像中执行绳索操作等复杂任务的性能表现与使用地面实况的神谕方法的可比性。