We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) with finite state and action spaces. We assume that the agent has access to a generative model and that the MDP possesses a unique optimal policy. In this setting, we derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimal gaps and the variance of the next-state value function, and thus really summarizes the hardness of the MDP. We devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic guarantees for its sample complexity (both almost surely and in expectation). The advantages of KLB-TS against state-of-the-art algorithms are finally discussed.

本文研究在马尔可夫决策过程中，通过生成模型来识别最优策略，提出了 KLB-TS 算法，并提供了其样本复杂度的渐近保证。

马尔可夫决策过程中最佳策略识别的自适应采样