Reinforcement Learning(RL) with sparse rewards is a major challenge. We
propose \emph{Hindsight Trust Region Policy Optimization}(HTRPO), a new RL
algorithm that extends the highly successful TRPO algorithm with
\emph{hindsight} to tackle the challenge of sparse rewards. Hindsight refers to
the algorithm's ability to learn from information across goals, including ones
not intended for the current task. HTRPO leverages two main ideas. It
introduces QKL, a quadratic approximation to the KL divergence constraint on
the trust region, leading to reduced variance in KL divergence estimation and
improved stability in policy update. It also presents Hindsight Goal
Filtering(HGF) to select conductive hindsight goals. In experiments, we
evaluate HTRPO in various sparse reward tasks, including simple benchmarks,
image-based Atari games, and simulated robot control. Ablation studies indicate
that QKL and HGF contribute greatly to learning stability and high performance.
Comparison results show that in all tasks, HTRPO consistently outperforms both
TRPO and HPG, a state-of-the-art algorithm for RL with sparse rewards.

我们提出了一种新的强化学习算法：Hindsight Trust Region Policy Optimization，它通过利用 hindsight 来提高稀疏抽奖的表现，并引入了 QKL 和 HGF 两种方法来提高学习稳定性和表现。我们在各种稀疏抽奖任务中评估了 HTRPO，包括简单的基准测试、基于图像的 Atari 游戏和模拟机器人控制。消融研究表明，QKL 和 HGF 对学习稳定性和高性能有很大贡献。比较结果表明，在所有任务中，HTRPO 始终优于 TRPO 和 HPG。

回顾性信任区域策略优化

Hindsight Trust Region Policy Optimization

A reinforcement learning agent that needs to pursue different goals across
episodes requires a goal-conditional policy. In addition to their potential to
generalize desirable behavior to unseen goals, such policies may also enable
higher-level planning based on subgoals. In sparse-reward environments, the
capacity to exploit information about the degree to which an arbitrary goal has
been achieved while another goal was intended appears crucial to enable sample
efficient learning. However, reinforcement learning agents have only recently
been endowed with such capacity for hindsight. In this paper, we demonstrate
how hindsight can be introduced to policy gradient methods, generalizing this
idea to a broad class of successful algorithms. Our experiments on a diverse
selection of sparse-reward environments show that hindsight leads to a
remarkable increase in sample efficiency.

本文研究如何将 hindsight 引入到 policy gradient 方法中，对各种稀疏奖励机制进行实验并表明 hindsight 能显著提高样本效率。