This paper gives the first polynomial-time algorithm for tabular Markov
Decision Processes (MDP) that enjoys a regret bound \emph{independent on the
planning horizon}. Specifically, we consider tabular MDP with $S$ states, $A$
actions, a planning horizon $H$, total reward bounded by $1$, and the agent
plays for $K$ episodes. We design an algorithm that achieves an
$O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to
existing bounds which either has an additional $\mathrm{polylog}(H)$
dependency~\citep{zhang2020reinforcement} or has an exponential dependency on
$S$~\citep{li2021settling}. Our result relies on a sequence of new structural
lemmas establishing the approximation power, stability, and concentration
property of stationary policies, which can have applications in other problems
related to Markov chains.

本文提出了第一个针对有限 MDP 多项式时间算法，具有独立于计划时间的后悔范围，并通过一系列的新结构引理，建立了稳定性和专注性，提高了 MDP 的近似能力和性能。

多项式时间的无界强化学习：静态策略的威力

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of  Stationary Policies

We propose an algorithm that uses linear function approximation (LFA) for
stochastic shortest path (SSP). Under minimal assumptions, it obtains sublinear
regret, is computationally efficient, and uses stationary policies. To our
knowledge, this is the first such algorithm in the LFA literature (for SSP or
other formulations). Our algorithm is a special case of a more general one,
which achieves regret square root in the number of episodes given access to a
certain computation oracle.

该研究提出了一种使用线性函数逼近算法的随机最短路径问题的算法，具有次线性 regret、计算效率高、使用平稳策略等特点，是该领域内第一种此类算法。