Given a Markov Decision Process (MDP) with $n$ states and $m$ actions per state, we study the number of iterations needed by Policy Iteratio n (PI) algorithms to converge. We consider two variations of PI: Howard's PI that changes all the actions with a positive advantage, and Sim plex-PI that only changes one action with maximal advantage. We show that Howard's PI terminates after at most $ n(m-1) \lceil \frac{1}{1-\gamma}\log (\frac{1}{1-\gamma}) \rceil $ iterations, improving by a factor $O(\log n)$ a result by Hansen et al. (2013), while Simplex-PI terminates after at most $ n(m-1) \lceil \frac{n}{1-\gamma} \log (\frac{n}{1-\gamma})\rceil $ iterations, improving by a factor 2 a result by Ye (2011). We then consider bounds that are independent of the discount factor $\gamma$. When the MDP is deterministic, we show that Simplex-PI terminates after at most $ 2 n^2 m (m-1) \lceil 2 (n-1) \log n \rceil \lceil 2 n \log n \rceil = O(n^4 m^2 \log^2 n) $ iterations, improving by a factor $O(n)$ a bound obtained by Post and Ye (2012). We generalize this result to general MDPs under some structural assumptions: given a measure of the maximal transient time $\tau_t$ and the maximal time $\tau_r$ to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most $ n^2 m (m-1) (\lceil \tau_r \log (n \tau_r) \rceil +\lceil \tau_r \log (n \tau_t) \rceil) \lceil {\tau_t} \log (n (\tau_t+1)) \rceil = \tilde O (n^2 \tau_t \tau_r m^2) $ iterations. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the MDP is weakly-communicating, we show that Simplex-PI and Howard's PI terminate after at most $n(m-1) (\lceil \tau_t \log n \tau_t \rceil + \lceil \tau_r \log n \tau_r \rceil) =\tilde O(nm (\tau_t+\tau_r))$ iterations.

本文研究了两种策略迭代算法在马尔可夫决策过程中收敛所需的迭代次数，并通过结构性质得到了与折扣因子无关的上界，在假设状态空间分为瞬态和常态状态的情况下，Howard's PI 和 Simplex-PI 都可以在强多项式时间内终止。

策略迭代复杂度的改进和推广的上界