In this paper, we consider the problem of online learning of Markov decision processes (MDPs) with very large state spaces. Under the assumptions of realizable function approximation and low Bellman ranks, we develop an online learning algorithm that learns the optimal value function while at the same time achieving very low cumulative regret during the learning process. Our learning algorithm, Adaptive Value-function Elimination (AVE), is inspired by the policy elimination algorithm proposed in (Jiang et al., 2017), known as OLIVE. One of our key technical contributions in AVE is to formulate the elimination steps in OLIVE as contextual bandit problems. This technique enables us to apply the active elimination and expert weighting methods from (Dudik et al., 2011), instead of the random action exploration scheme used in the original OLIVE algorithm, for more efficient exploration and better control of the regret incurred in each policy elimination step. To the best of our knowledge, this is the first $\sqrt{n}$-regret result for reinforcement learning in stochastic MDPs with general value function approximation.

本文提出了一种在线学习算法，即Adaptive Value-function Elimination（AVE），用于大规模状态空间下的Markov决策过程（MDPs），形式化了OLIVE中的淘汰步骤为上下文乐队问题，从而在学习过程中实现了最优价值函数的学习和非常低的累积遗憾，这是首次在具有一般价值函数逼近的随机MDPs中以Θ（√n）的累积遗憾结果呈现出增强学习。

$\sqrt{n}$-Regret算法在带有函数逼近和低Bellman等级的马尔可夫决策过程学习中的应用