We consider the problem of finding the best memoryless stochastic policy for an infinite-horizon partially observable Markov decision process (POMDP) with finite state and action spaces with respect to either the discounted or mean reward criterion. We show that the (discounted) state-action frequencies and the expected cumulative reward are rational functions of the policy, whereby the degree is determined by the degree of partial observability. We then describe the optimization problem as a linear optimization problem in the space of feasible state-action frequencies subject to polynomial constraints that we characterize explicitly. This allows us to address the combinatorial and geometric complexity of the optimization problem using recent tools from polynomial optimization. In particular, we demonstrate how the partial observability constraints can lead to multiple smooth and non-smooth local optimizers and we estimate the number of critical points.

本研究考虑了有限状态和动作空间的无穷时部分观察到的马尔可夫决策问题中，根据折扣或平均收益准则找到最佳的无记忆随机策略并描述了优化问题作为可行状态-动作频率空间中的线性优化问题并使用了多项式优化的最大化奖励来解决导航问题。

无记忆随机策略优化在无限时域POMDP中的几何