We present an algorithm based on the \emph{Optimism in the Face of
Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL)
modeled by Markov decision process (MDP) with finite state-action space
efficiently. By evaluating the state-pair difference of the optimal bias
function $h^{*}$, the proposed algorithm achieves a regret bound of
$\tilde{O}(\sqrt{SAHT})$\footnote{The symbol $\tilde{O}$ means $O$ with log
factors ignored. } for MDP with $S$ states and $A$ actions, in the case that an
upper bound $H$ on the span of $h^{*}$, i.e., $sp(h^{*})$ is known. This result
outperforms the best previous regret bounds $\tilde{O}(S\sqrt{AHT})
$\citep{fruit2019improved} by a factor of $\sqrt{S}$. Furthermore, this regret
bound matches the lower bound of $\Omega(\sqrt{SAHT}) $\citep{jaksch2010near}
up to a logarithmic factor. As a consequence, we show that there is a near
optimal regret bound of $\tilde{O}(\sqrt{SADT})$ for MDPs with a finite
diameter $D$ compared to the lower bound of $\Omega(\sqrt{SADT})
$\citep{jaksch2010near}.

基于 “面对不确定性的乐观原则” 的算法，使用有限状态 - 动作空间的、用马尔可夫决策过程（MDP）建模的强化学习（RL）有效学习。通过评估最佳偏置函数 $h^{*}$ 的状态对差异，该算法在已知 $sp (h^{*})$ 的情况下实现 MDP 的遗憾界为 $\tilde {O}(\sqrt {SAHT})$，这个结果超过了先前的最佳遗憾界 $\tilde {O}(S\sqrt {AHT})$，并且匹配了遗憾下界。此外，对于有限直径 $D$ 的 MDP，我们证明了 $	ilde {O}(\sqrt {SADT})$ 接近于最佳遗憾上界。