Goal-oriented Reinforcement Learning, where the agent needs to reach the goal
state while simultaneously minimizing the cost, has received significant
attention in real-world applications. Its theoretical formulation, stochastic
shortest path (SSP), has been intensively researched in the online setting.
Nevertheless, it remains understudied when such an online interaction is
prohibited and only historical data is provided. In this paper, we consider the
offline stochastic shortest path problem when the state space and the action
space are finite. We design the simple value iteration-based algorithms for
tackling both offline policy evaluation (OPE) and offline policy learning
tasks. Notably, our analysis of these simple algorithms yields strong
instance-dependent bounds which can imply worst-case bounds that are
near-minimax optimal. We hope our study could help illuminate the fundamental
statistical limits of the offline SSP problem and motivate further studies
beyond the scope of current consideration.

本文研究了离线情况下有限状态和动作空间下的目标导向强化学习，提出基于简单值迭代的算法来解决离线策略评估和学习任务，并分析了这些算法的强实例相关界限。

离线随机最短路径：学习、评估与优化

Offline Stochastic Shortest Path: Learning, Evaluation and Towards Optimality

Stochastic Shortest Path (SSP) MDPs is a problem class widely studied in AI,
especially in probabilistic planning. They describe a wide range of scenarios
but make the restrictive assumption that the goal is reachable from any state,
i.e., that dead-end states do not exist. Because of this, SSPs are unable to
model various scenarios that may have catastrophic events (e.g., an airplane
possibly crashing if it flies into a storm). Even though MDP algorithms have
been used for solving problems with dead ends, a principled theory of SSP
extensions that would allow dead ends, including theoretically sound algorithms
for solving such MDPs, has been lacking. In this paper, we propose three new
MDP classes that admit dead ends under increasingly weaker assumptions. We
present Value Iteration-based as well as the more efficient heuristic search
algorithms for optimally solving each class, and explore theoretical
relationships between these classes. We also conduct a preliminary empirical
study comparing the performance of our algorithms on different MDP classes,
especially on scenarios with unavoidable dead ends.

本文提出了三种新的 MDP 类，允许无法到达的目标，并呈现了具有理论基础的算法，探讨了这些类之间的理论关系，并进行了初步的实证研究。