折扣马尔可夫决策过程的 PAC 上界

Feb, 2012

PAC Bounds for Discounted MDPs

Tor Lattimore, Marcus Hutter

TL;DR本文旨在研究在有限状态折扣马尔可夫决策过程中，学习接近最优行为的样本复杂度的上下界，并在假设每个动作导致的下一个状态至多有两个的情况下证明了UCRL算法的新界限，同时还通过更加通用且更加严格的下界加强了之前的工作。这些上下界在对数因子上相吻合。

Abstract

We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted markov decision processes<