Offline Reinforcement Learning (RL) has emerged as a powerful alternative to imitation learning for behavior modeling in various domains, particularly in complex navigation tasks. An existing challenge with Offline RL is the signal-to-noise ratio, i.e. how to mitigate incorrect policy updates due to errors in value estimates. Towards this, multiple works have demonstrated the advantage of hierarchical offline RL methods, which decouples high-level path planning from low-level path following. In this work, we present a novel hierarchical transformer-based approach leveraging a learned quantizer of the space. This quantization enables the training of a simpler zone-conditioned low-level policy and simplifies planning, which is reduced to discrete autoregressive prediction. Among other benefits, zone-level reasoning in planning enables explicit trajectory stitching rather than implicit stitching based on noisy value function estimates. By combining this transformer-based planner with recent advancements in offline RL, our proposed approach achieves state-of-the-art results in complex long-distance navigation environments.

本文解决了离线强化学习中，因价值估计误差导致的信号与噪声比问题。研究提出了一种基于变换器的分层方法，通过学习量化空间，简化了低级策略的训练和规划过程，显著提高了在复杂长距离导航环境中的性能。该方法展示了明确的轨迹拼接能力，对改进离线强化学习具有重要影响。

利用QPHIL进行导航：分层隐式Q学习的量化规划器