An agent that has well understood the environment should be able to apply its
skills for any given goals, leading to the fundamental problem of learning the
Universal Value Function Approximator (UVFA). A UVFA learns to predict the
cumulative rewards between all state-goal pairs. However, empirically, the
value function for long-range goals is always hard to estimate and may
consequently result in failed policy. This has presented challenges to the
learning process and the capability of neural networks. We propose a method to
address this issue in large MDPs with sparse rewards, in which exploration and
routing across remote states are both extremely challenging. Our method
explicitly models the environment in a hierarchical manner, with a high-level
dynamic landmark-based map abstracting the visited state space, and a low-level
value network to derive precise local decisions. We use farthest point sampling
to select landmark states from past experience, which has improved exploration
compared with simple uniform sampling. Experimentally we showed that our method
enables the agent to reach long-range goals at the early training stage, and
achieve better performance than standard RL algorithms for a number of
challenging tasks.

本文提出了一种在具有稀疏奖励下的大型 MDPs 中处理 long-range goals 的方法，该方法通过分层建模、farthest point sampling 和 RL 算法的结合来解决这个问题。实验结果表明，该方法比标准的 RL 算法更能有效地达成目标。