We consider policy optimization methods in reinforcement learning settings where the state space is arbitrarily large, or even countably infinite. The motivation arises from control problems in communication networks, matching markets, and other queueing systems. We consider Natural Policy Gradient (NPG), which is a popular algorithm for finite state spaces. Under reasonable assumptions, we derive a performance bound for NPG that is independent of the size of the state space, provided the error in policy evaluation is within a factor of the true value function. We obtain this result by establishing new policy-independent bounds on the solution to Poisson's equation, i.e., the relative value function, and by combining these bounds with previously known connections between MDPs and learning from experts.

本研究考虑了在强化学习环境中状态空间任意大、甚至可数无限的政策优化方法，重点是通信网络、匹配市场和其他排队系统的控制问题。我们研究了自然策略梯度 (Natural Policy Gradient, NPG) 在有限状态空间上的流行算法。在合理的假设下，我们得到了 NPG 的性能上界，该上界与状态空间的大小无关，只要策略评估误差在真实值函数的某个倍数内。我们通过建立关于 Poisson 方程解（即相对值函数）的新的与策略无关的界限，并将这些界限与先前已知的马尔可夫决策过程 (MDP) 和从专家学习的联系相结合，得到了这个结果。

NPG在可数状态空间平均成本强化学习中的性能