Modern tasks in reinforcement learning are always with large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represents states and actions in a low dimensional space. In this paper, we study reinforcement learning with feature mapping for discounted Markov Decision Processes (MDPs). We propose a novel algorithm which makes use of the feature mapping and obtains a $\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret, where $d$ is the dimension of the feature space, $T$ is the time horizon and $\gamma$ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing to a generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $(1-\gamma)^{-0.5}$ factor.

本论文介绍了一种基于特性映射的新算法，能够以线性的方式参数化转移核函数来处理强化学习中的大状态和行动空间，并且证明了该算法在一些强化学习的问题中，不需要访问生成模型就能取得多项式的最优后悔值，且总体上是近乎最优的。

具有特征映射的折扣 MDP 的可证明高效强化学习