Hindsight goal relabeling has become a foundational technique for multi-goal reinforcement learning (RL). The idea is quite simple: any arbitrary trajectory can be seen as an expert demonstration for reaching the trajectory's end state. Intuitively, this procedure trains a goal-conditioned policy to imitate a sub-optimal expert. However, this connection between imitation and hindsight relabeling is not well understood. Modern imitation learning algorithms are described in the language of divergence minimization, and yet it remains an open problem how to recast hindsight goal relabeling into that framework. In this work, we develop a unified objective for goal-reaching that explains such a connection, from which we can derive goal-conditioned supervised learning (GCSL) and the reward function in hindsight experience replay (HER) from first principles. Experimentally, we find that despite recent advances in goal-conditioned behaviour cloning (BC), multi-goal Q-learning can still outperform BC-like methods; moreover, a vanilla combination of both actually hurts model performance. Under our framework, we study when BC is expected to help, and empirically validate our findings. Our work further bridges goal-reaching and generative modeling, illustrating the nuances and new pathways of extending the success of generative models to RL.

本文从分歧最小化的角度解释了追溯目标重标记技术在多目标强化学习中的应用，将目标达成问题重新定义为模仿学习框架，并从该框架中推导出多种算法。实验结果表明，与行为克隆相比，Q-learning算法在追溯重标记技术下表现更优，但两者的普通组合会降低性能。此外，该论文还解释了奖励为（-1,0）明显优于（0,1）时的困惑现象。

从最小化差异的角度理解事后目标重标记