Markov decision processes (MDPs) provide a standard framework for sequential
decision making under uncertainty. However, transition probabilities in MDPs
are often estimated from data and MDPs do not take data uncertainty into
account. Robust Markov decision processes (RMDPs) address this shortcoming of
MDPs by assigning to each transition an uncertainty set rather than a single
probability value. The goal of solving RMDPs is then to find a policy which
maximizes the worst-case performance over the uncertainty sets. In this work,
we consider polytopic RMDPs in which all uncertainty sets are polytopes and
study the problem of solving long-run average reward polytopic RMDPs. Our focus
is on computational complexity aspects and efficient algorithms. We present a
novel perspective on this problem and show that it can be reduced to solving
long-run average reward turn-based stochastic games with finite state and
action spaces. This reduction allows us to derive several important
consequences that were hitherto not known to hold for polytopic RMDPs. First,
we derive new computational complexity bounds for solving long-run average
reward polytopic RMDPs, showing for the first time that the threshold decision
problem for them is in NP coNP and that they admit a randomized algorithm with
sub-exponential expected runtime. Second, we present Robust Polytopic Policy
Iteration (RPPI), a novel policy iteration algorithm for solving long-run
average reward polytopic RMDPs. Our experimental evaluation shows that RPPI is
much more efficient in solving long-run average reward polytopic RMDPs compared
to state-of-the-art methods based on value iteration.

马尔科夫决策过程（MDPs）为不确定性下的顺序决策制定了标准框架，但是 MDPs 中的转移概率通常是从数据中估计的，并且 MDPs 不考虑数据的不确定性。鲁棒马尔科夫决策过程（RMDPs）通过为每个转移分配不确定性集合而不是单个概率值来解决了 MDPs 的这个缺点。解决 RMDPs 的目标是找到一种策略，使得在不确定性集合上最大化最坏情况的性能。本文考虑多面体 RMDPs，在其中所有的不确定性集合都是多面体，并研究解决长期平均回报的多面体 RMDPs 的问题。我们关注计算复杂性方面和高效算法。我们提出了这个问题的一个新视角，并且证明它可以简化为解决具有有限状态和动作空间的长期平均回报的轮流随机游戏。这个简化使我们能够得出几个重要的结论，这些结论以前是未知的。首先，我们为解决长期平均回报的多面体 RMDPs 推导出新的计算复杂性界限，首次证明它们的阈值决策问题属于 NP coNP，并且它们具有具有亚指数期望运行时间的随机算法。其次，我们提出了鲁棒多面体策略迭代（RPPI），一种用于解决长期平均回报的多面体 RMDPs 的新型策略迭代算法。我们的实验评估表明，相比基于值迭代的现有方法，RPPI 在解决长期平均回报的多面体 RMDPs 方面更加高效。