Recently discovered polyhedral structures of the value function for finite state-action discounted Markov decision processes (MDP) shed light on understanding the success of reinforcement learning. We investigate the value function polytope in greater detail and characterize the polytope boundary using a hyperplane arrangement. We further show that the value space is a union of finitely many cells of the same hyperplane arrangement and relate it to the polytope of the classical linear programming formulation for MDPs. Inspired by these geometric properties, we propose a new algorithm, \emph{Geometric Policy Iteration} (GPI), to solve discounted MDPs. GPI updates the policy of a single state by switching to an action that is mapped to the boundary of the value function polytope, followed by an immediate update of the value function. This new update rule aims at a faster value improvement without compromising computational efficiency. Moreover, our algorithm allows asynchronous updates of state values which is more flexible and advantageous compared to traditional policy iteration when the state set is large. We prove that the complexity of GPI achieves the best known bound $\bigO{\frac{|\actions|}{1 - \gamma}\log \frac{1}{1-\gamma}}$ of policy iteration and empirically demonstrate the strength of GPI on MDPs of various sizes.

探究了有限状态-动作折扣马尔可夫决策过程的价值函数多面体结构，并使用超平面排列表征了多面体的边界。提出了一种新的算法Geometric Policy Iteration (GPI)来解决折扣MDPs，它使用单个状态的策略更新，以更快的价值改进不影响计算效率，同时允许状态值的异步更新。证明了GPI的复杂度达到了策略迭代的最佳已知界限，并展示了GPI在各种大小的MDPs上的优越性。

马尔可夫决策过程的几何策略迭代