We adopt a policy optimization viewpoint towards policy evaluation for robust Markov decision process with $\mathrm{s}$-rectangular ambiguity sets. The developed method, named first-order policy evaluation (FRPE), provides the first unified framework for robust policy evaluation in both deterministic (offline) and stochastic (online) settings, with either tabular representation or generic function approximation. In particular, we establish linear convergence in the deterministic setting, and $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity in the stochastic setting. FRPE also extends naturally to evaluating the robust state-action value function with $(\mathrm{s}, \mathrm{a})$-rectangular ambiguity sets. We discuss the application of the developed results for stochastic policy optimization of large-scale robust MDPs.

我们采用政策优化观点对具有s-矩形不确定性集的鲁棒马尔可夫决策过程进行政策评估。所开发的方法被称为一阶政策评估（FRPE），为确定性（离线）和随机（在线）设置下的鲁棒政策评估提供了第一种统一框架，无论是表格表示还是通用函数逼近。具体而言，我们在确定性设置中建立了线性收敛性，并在随机设置中具有O(1/ε^2)的抽样复杂度。FRPE还自然地推广到具有(s, a)-矩形不确定性集的鲁棒状态-动作值函数的评估。我们讨论了将所开发结果应用于大规模鲁棒MDP的随机政策优化。

强化学习的一阶政策优化方法实现鲁棒政策评估