We present an algorithm based on the Optimism in the Face of Uncertainty (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function $h^{*}$, the proposed algorithm achieves a regret bound of $\tilde{O}(\sqrt{SAHT})$for MDP with $S$ states and $A$ actions, in the case that an upper bound $H$ on the span of $h^{*}$, i.e., $sp(h^{*})$ is known. This result outperforms the best previous regret bounds $\tilde{O}(HS\sqrt{AT})$ [Bartlett and Tewari, 2009] by a factor of $\sqrt{SH}$. Furthermore, this regret bound matches the lower bound of $\Omega(\sqrt{SAHT})$ [Jaksch et al., 2010] up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of $\tilde{O}(\sqrt{SADT})$ for MDPs with finite diameter $D$ compared to the lower bound of $\Omega(\sqrt{SADT})$ [Jaksch et al., 2010].

基于“面对不确定性的乐观原则”的算法，使用有限状态-动作空间的、用马尔可夫决策过程（MDP）建模的强化学习（RL）有效学习。通过评估最佳偏置函数$h^{*}$的状态对差异，该算法在已知$sp(h^{*})$的情况下实现MDP的遗憾界为$\tilde{O}(\sqrt{SAHT})$，这个结果超过了先前的最佳遗憾界$\tilde{O}(S\sqrt{AHT})$，并且匹配了遗憾下界。此外，对于有限直径$D$的MDP，我们证明了$	ilde{O}(\sqrt{SADT})$接近于最佳遗憾上界。

通过评估最优偏置函数实现强化学习的遗憾最小化