Recently discovered polyhedral structures of the value function for finite
state-action discounted Markov decision processes (MDP) shed light on
understanding the success of reinforcement learning. We investigate the value
function polytope in greater detail and characterize the polytope boundary
using a hyperplane arrangement. We further show that the value space is a union
of finitely many cells of the same hyperplane arrangement and relate it to the
polytope of the classical linear programming formulation for MDPs. Inspired by
these geometric properties, we propose a new algorithm, Geometric Policy
Iteration (GPI), to solve discounted MDPs. GPI updates the policy of a single
state by switching to an action that is mapped to the boundary of the value
function polytope, followed by an immediate update of the value function. This
new update rule aims at a faster value improvement without compromising
computational efficiency. Moreover, our algorithm allows asynchronous updates
of state values which is more flexible and advantageous compared to traditional
policy iteration when the state set is large. We prove that the complexity of
GPI achieves the best known bound $\mathcal{O}\left(\frac{|\mathcal{A}|}{1 -
\gamma}\log \frac{1}{1-\gamma}\right)$ of policy iteration and empirically
demonstrate the strength of GPI on MDPs of various sizes.

探究了有限状态 - 动作折扣马尔可夫决策过程的价值函数多面体结构，并使用超平面排列表征了多面体的边界。提出了一种新的算法 Geometric Policy Iteration (GPI) 来解决折扣 MDPs，它使用单个状态的策略更新，以更快的价值改进不影响计算效率，同时允许状态值的异步更新。证明了 GPI 的复杂度达到了策略迭代的最佳已知界限，并展示了 GPI 在各种大小的 MDPs 上的优越性。