In robust Markov decision processes (MDPs), the uncertainty in the transition kernel is addressed by finding a policy that optimizes the worst-case performance over an uncertainty set of MDPs. While much of the literature has focused on discounted MDPs, robust average-reward MDPs remain largely unexplored. In this paper, we focus on robust average-reward MDPs, where the goal is to find a policy that optimizes the worst-case average reward over an uncertainty set. We first take an approach that approximates average-reward MDPs using discounted MDPs. We prove that the robust discounted value function converges to the robust average-reward as the discount factor $\gamma$ goes to $1$, and moreover, when $\gamma$ is large, any optimal policy of the robust discounted MDP is also an optimal policy of the robust average-reward. We further design a robust dynamic programming approach, and theoretically characterize its convergence to the optimum. Then, we investigate robust average-reward MDPs directly without using discounted MDPs as an intermediate step. We derive the robust Bellman equation for robust average-reward MDPs, prove that the optimal policy can be derived from its solution, and further design a robust relative value iteration algorithm that provably finds its solution, or equivalently, the optimal robust policy.

本篇论文研究鲁棒平均回报MDP问题，旨在找到一种策略，使其在不确定性的MDP集合中的最坏平均回报最优化。作者探讨了利用折扣MDP实现这个问题，证明了当折扣因子趋近于1时，鲁棒折扣价值函数收敛于鲁棒平均回报，并设计了鲁棒动态规划方法。同时，也考虑了直接处理鲁棒平均回报MDP问题的情况，并导出了其鲁棒Bellman方程，设计了一种鲁棒相对价值迭代算法来求解其策略。

鲁棒平均奖励马尔科夫决策过程