As reinforcement learning continues to drive machine intelligence beyond its conventional boundary, unsubstantial practices in sparse reward environment severely limit further applications in a broader range of advanced fields. Motivated by the demand for an effective deep reinforcement learning algorithm that accommodates sparse reward environment, this paper presents Hindsight Trust Region Policy Optimization (Hindsight TRPO), a method that efficiently utilizes interactions in sparse reward conditions and maintains learning stability by restricting variance during the policy update process. Firstly, the hindsight methodology is expanded to TRPO, an advanced and efficient on-policy policy gradient method. Then, under the condition that the distributions are close, the KL-divergence is appropriately approximated by another $f$-divergence. Such approximation results in the decrease of variance during KL-divergence estimation and alleviates the instability during policy update. Experimental results on both discrete and continuous benchmark tasks demonstrate that Hindsight TRPO converges steadily and significantly faster than previous policy gradient methods. It achieves effective performances and high data-efficiency for training policies in sparse reward environments.

我们提出了一种新的强化学习算法：Hindsight Trust Region Policy Optimization，它通过利用hindsight来提高稀疏抽奖的表现，并引入了QKL和HGF两种方法来提高学习稳定性和表现。我们在各种稀疏抽奖任务中评估了HTRPO，包括简单的基准测试、基于图像的 Atari 游戏和模拟机器人控制。消融研究表明，QKL和HGF对学习稳定性和高性能有很大贡献。比较结果表明，在所有任务中，HTRPO始终优于TRPO和HPG。

回顾性信任区域策略优化