Goal-oriented Reinforcement Learning, where the agent needs to reach the goal
state while simultaneously minimizing the cost, has received significant
attention in real-world applications. Its theoretical formulation, stochastic
shortest path (SSP), has been intensively researched in the online setting.
Nevertheless, it remains understudied when such an online interaction is
prohibited and only historical data is provided. In this paper, we consider the
offline stochastic shortest path problem when the state space and the action
space are finite. We design the simple value iteration-based algorithms for
tackling both offline policy evaluation (OPE) and offline policy learning
tasks. Notably, our analysis of these simple algorithms yields strong
instance-dependent bounds which can imply worst-case bounds that are
near-minimax optimal. We hope our study could help illuminate the fundamental
statistical limits of the offline SSP problem and motivate further studies
beyond the scope of current consideration.

本文研究了离线情况下有限状态和动作空间下的目标导向强化学习，提出基于简单值迭代的算法来解决离线策略评估和学习任务，并分析了这些算法的强实例相关界限。

离线随机最短路径：学习、评估与优化

Offline Stochastic Shortest Path: Learning, Evaluation and Towards Optimality

In Goal-oriented Reinforcement learning, relabeling the raw goals in past
experience to provide agents with hindsight ability is a major solution to the
reward sparsity problem. In this paper, to enhance the diversity of relabeled
goals, we develop FGI (Foresight Goal Inference), a new relabeling strategy
that relabels the goals by looking into the future with a learned dynamics
model. Besides, to improve sample efficiency, we propose to use the dynamics
model to generate simulated trajectories for policy training. By integrating
these two improvements, we introduce the MapGo framework (Model-Assisted Policy
Optimization for Goal-oriented tasks). In our experiments, we first show the
effectiveness of the FGI strategy compared with the hindsight one, and then
show that the MapGo framework achieves higher sample efficiency when compared
to model-free baselines on a set of complicated tasks.

本文提出了一种名为 FGI 的新的重标记策略用于改善回报稀疏性问题，并通过引入动态模型来生成模拟轨迹来提高采样效率，提出了一种名为 MapGo 框架用于目标导向任务的模型辅助策略优化， 并在复杂任务上的实验证明了 FGI 策略相比后见策略的有效性，并且 MapGo 框架相对于无模型的基线表现出更高的采样效率。

MapGo: 面向目标任务的模型辅助策略优化

MapGo: Model-Assisted Policy Optimization for Goal-Oriented Tasks

Goal-oriented reinforcement learning has recently been a practical framework
for robotic manipulation tasks, in which an agent is required to reach a
certain goal defined by a function on the state space. However, the sparsity of
such reward definition makes traditional reinforcement learning algorithms very
inefficient. Hindsight Experience Replay (HER), a recent advance, has greatly
improved sample efficiency and practical applicability for such problems. It
exploits previous replays by constructing imaginary goals in a simple heuristic
way, acting like an implicit curriculum to alleviate the challenge of sparse
reward signal. In this paper, we introduce Hindsight Goal Generation (HGG), a
novel algorithmic framework that generates valuable hindsight goals which are
easy for an agent to achieve in the short term and are also potential for
guiding the agent to reach the actual goal in the long term. We have
extensively evaluated our goal generation algorithm on a number of robotic
manipulation tasks and demonstrated substantially improvement over the original
HER in terms of sample efficiency.

本文介绍了一种基于目标导向的强化学习新算法框架 Hindsight Goal Generation，该框架通过生成有助于智能体在短期内实现的前瞻性目标以指导其在长期内实现实际目标的路径，以显著提高采样效率和处理奖励稀疏性问题。在多项机器人操作任务中，实验证明了该算法的有效性和优越性。