We consider the problem of learning the optimal action-value function in the discounted-reward Markov decision processes (MDPs). We prove a new PAC bound on the sample-complexity of model-based value iteration algorithm in the presence of the generative model, which indicates that for an MDP with N state-action pairs and the discount factor \gamma\in[0,1) only O(N\log(N/\delta)/((1-\gamma)^3\epsilon^2)) samples are required to find an \epsilon-optimal estimation of the action-value function with the probability 1-\delta. We also prove a matching lower bound of \Theta (N\log(N/\delta)/((1-\gamma)^3\epsilon^2)) on the sample complexity of estimating the optimal action-value function by every RL algorithm. To the best of our knowledge, this is the first matching result on the sample complexity of estimating the optimal (action-) value function in which the upper bound matches the lower bound of RL in terms of N, \epsilon, \delta and 1/(1-\gamma). Also, both our lower bound and our upper bound significantly improve on the state-of-the-art in terms of 1/(1-\gamma).

本文使用生成模型证明了在马尔可夫决策过程中，基于值迭代算法的样本复杂度PAC上限为O(Nlog(N/δ)/((1-γ)³ε²))，其中N为状态-动作对的数量，γ为折扣因子，ε表示动作价值函数的ε-最优估计，δ为概率。同时证明了在任何强化学习算法中，基于每个状态-动作对估计最优动作值函数的样本复杂度下限为Θ(Nlog(N/δ)/((1-γ)³ε²))，该上限和下限在N，ε、δ、1/(1-γ)方面匹配。

强化学习中基于生成模型的样本复杂度研究