We study the sample complexity of learning an $\epsilon$-optimal policy in
the Stochastic Shortest Path (SSP) problem. We first derive sample complexity
bounds when the learner has access to a generative model. We show that there
exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost
$c_{\min}$, and maximum expected cost of the optimal policy over all states
$B_{\star}$, where any algorithm requires at least
$\Omega(SAB_{\star}^3/(c_{\min}\epsilon^2))$ samples to return an
$\epsilon$-optimal policy with high probability. Surprisingly, this implies
that whenever $c_{\min}=0$ an SSP problem may not be learnable, thus revealing
that learning in SSPs is strictly harder than in the finite-horizon and
discounted settings. We complement this result with lower bounds when prior
knowledge of the hitting time of the optimal policy is available and when we
restrict optimality by competing against policies with bounded hitting time.
Finally, we design an algorithm with matching upper bounds in these cases. This
settles the sample complexity of learning $\epsilon$-optimal polices in SSP
with generative models.
We also initiate the study of learning $\epsilon$-optimal policies without
access to a generative model (i.e., the so-called best-policy identification
problem), and show that sample-efficient learning is impossible in general. On
the other hand, efficient learning can be made possible if we assume the agent
can directly reach the goal state from any state by paying a fixed cost. We
then establish the first upper and lower bounds under this assumption.
Finally, using similar analytic tools, we prove that horizon-free regret is
impossible in SSPs under general costs, resolving an open problem in
(Tarbouriech et al., 2021c).

本文研究计算马尔科夫决策过程中随机最短路径问题中，学习合理策略的采样复杂度，得到在有选项模型的情况下，学习合理策略的采样下界，并提出一种能够匹配界限的算法。同时，探讨在没有选项模型的情况下学习最佳策略识别问题中的高效学习可能性，并证明在一些假设下是实现可能的。

达成目标很困难：解决随机最短路径样本复杂度问题

Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path

We investigate the problem of best-policy identification in discounted Markov
Decision Processes (MDPs) when the learner has access to a generative model.
The objective is to devise a learning algorithm returning the best policy as
early as possible. We first derive a problem-specific lower bound of the sample
complexity satisfied by any learning algorithm. This lower bound corresponds to
an optimal sample allocation that solves a non-convex program, and hence, is
hard to exploit in the design of efficient algorithms. We then provide a simple
and tight upper bound of the sample complexity lower bound, whose corresponding
nearly-optimal sample allocation becomes explicit. The upper bound depends on
specific functionals of the MDP such as the sub-optimality gaps and the
variance of the next-state value function, and thus really captures the
hardness of the MDP. Finally, we devise KLB-TS (KL Ball Track-and-Stop), an
algorithm tracking this nearly-optimal allocation, and provide asymptotic
guarantees for its sample complexity (both almost surely and in expectation).
The advantages of KLB-TS against state-of-the-art algorithms are discussed and
illustrated numerically.

本文研究在马尔可夫决策过程中，通过生成模型来识别最优策略，提出了 KLB-TS 算法，并提供了其样本复杂度的渐近保证。

马尔可夫决策过程中最佳策略识别的自适应采样

Adaptive Sampling for Best Policy Identification in Markov Decision  Processes

Reward-free exploration is a reinforcement learning setting studied by Jin et
al. (2020), who address it by running several algorithms with regret guarantees
in parallel. In our work, we instead give a more natural adaptive approach for
reward-free exploration which directly reduces upper bounds on the maximum MDP
estimation error. We show that, interestingly, our reward-free UCRL algorithm
can be seen as a variant of an algorithm of Fiechter from 1994, originally
proposed for a different objective that we call best-policy identification. We
prove that RF-UCRL needs of order $({SAH^4}/{\varepsilon^2})(\log(1/\delta) +
S)$ episodes to output, with probability $1-\delta$, an
$\varepsilon$-approximation of the optimal policy for any reward function. This
bound improves over existing sample-complexity bounds in both the small
$\varepsilon$ and the small $\delta$ regimes. We further investigate the
relative complexities of reward-free exploration and best-policy
identification.

我们提出了一种新的自适应奖励免费探索方法，直接降低最大 MDP 估计误差的上限并证明了 RF-UCRL 算法具有良好的采样复杂性界限，可以看作是 Fiechter 算法的变体，该算法最初是针对另一种目标：最佳策略识别。