TL;DR本文研究了两种策略迭代算法在马尔可夫决策过程中收敛所需的迭代次数,并通过结构性质得到了与折扣因子无关的上界,在假设状态空间分为瞬态和常态状态的情况下,Howard's PI 和 Simplex-PI 都可以在强多项式时间内终止。
Abstract
Given a markov decision process (MDP) with $n$ states and $m$ actions per state, we study the number of iterations needed by Policy Iteratio n (PI) algorithms to converge. We consider two variations of PI: Howard's PI that changes all the actions with a positive advantage, and Sim plex