We consider the problem of solving robust Markov decision process (MDP),
which involves a set of discounted, finite state, finite action space MDPs with
uncertain transition kernels. The goal of planning is to find a robust policy
that optimizes the worst-case values against the transition uncertainties, and
thus encompasses the standard MDP planning as a special case. For
$(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we develop a
policy-based first-order method, namely the robust policy mirror descent
(RPMD), and establish an $\mathcal{O}(\log(1/\epsilon))$ and
$\mathcal{O}(1/\epsilon)$ iteration complexity for finding an
$\epsilon$-optimal policy, with two increasing-stepsize schemes. The prior
convergence of RPMD is applicable to any Bregman divergence, provided the
policy space has bounded radius measured by the divergence when centering at
the initial policy. Moreover, when the Bregman divergence corresponds to the
squared euclidean distance, we establish an $\mathcal{O}(\max \{1/\epsilon,
1/(\eta \epsilon^2)\})$ complexity of RPMD with any constant stepsize $\eta$.
For a general class of Bregman divergences, a similar complexity is also
established for RPMD with constant stepsizes, provided the uncertainty set
satisfies the relative strong convexity. We further develop a stochastic
variant, named SRPMD, when the first-order information is only available
through online interactions with the nominal environment. For general Bregman
divergences, we establish an $\mathcal{O}(1/\epsilon^2)$ and
$\mathcal{O}(1/\epsilon^3)$ sample complexity with two increasing-stepsize
schemes. For the euclidean Bregman divergence, we establish an
$\mathcal{O}(1/\epsilon^3)$ sample complexity with constant stepsizes. To the
best of our knowledge, all the aforementioned results appear to be new for
policy-based first-order methods applied to the robust MDP problem.

研究如何解决具有不确定转移内核的折现，有限状态，有限行动空间 MDP 的强鲁棒性问题，旨在寻找一个抵抗传递不确定性的最佳策略。与标准 MDP 规划相比，本文提出了一个名为 RPMD 的策略型一阶方法，并对于两种递增步长的情形，建立了寻找 ε- 最优策略的 O (log (1/ε)) 和 O (1/ε) 迭代复杂度。本文还提出了一种名为 SRPMD 的随机变量。

鲁棒马尔科夫决策过程的一阶策略优化

First-order Policy Optimization for Robust Markov Decision Process

In high-stake scenarios like medical treatment and auto-piloting, it's risky
or even infeasible to collect online experimental data to train the agent.
Simulation-based training can alleviate this issue, but may suffer from its
inherent mismatches from the simulator and real environment. It is therefore
imperative to utilize the simulator to learn a robust policy for the real-world
deployment. In this work, we consider policy learning for Robust Markov
Decision Processes (RMDP), where the agent tries to seek a robust policy with
respect to unexpected perturbations on the environments. Specifically, we focus
on the setting where the training environment can be characterized as a
generative model and a constrained perturbation can be added to the model
during testing. Our goal is to identify a near-optimal robust policy for the
perturbed testing environment, which introduces additional technical
difficulties as we need to simultaneously estimate the training environment
uncertainty from samples and find the worst-case perturbation for testing. To
solve this issue, we propose a generic method which formalizes the perturbation
as an opponent to obtain a two-player zero-sum game, and further show that the
Nash Equilibrium corresponds to the robust policy. We prove that, with a
polynomial number of samples from the generative model, our algorithm can find
a near-optimal robust policy with a high probability. Our method is able to
deal with general perturbations under some mild assumptions and can also be
extended to more complex problems like robust partial observable Markov
decision process, thanks to the game-theoretical formulation.

利用模拟器训练代理人以学习强健的策略是解决医疗、自动驾驶等高风险环境下数据实验不可行的问题。本篇研究以生成模型的形式将训练环境表达，并提出了一种基于博弈论的算法解决了在测试中出现的扰动与环境不确定性的问题，得到了一个近似最优的强健决策。

基于非匹配生成模型的稳健马尔可夫决策过程的策略学习

Policy Learning for Robust Markov Decision Process with a Mismatched  Generative Model

Bayesian optimisation has been successfully applied to a variety of
reinforcement learning problems. However, the traditional approach for learning
optimal policies in simulators does not utilise the opportunity to improve
learning by adjusting certain environment variables: state features that are
unobservable and randomly determined by the environment in a physical setting
but are controllable in a simulator. This paper considers the problem of
finding a robust policy while taking into account the impact of environment
variables. We present Alternating Optimisation and Quadrature (ALOQ), which
uses Bayesian optimisation and Bayesian quadrature to address such settings.
ALOQ is robust to the presence of significant rare events, which may not be
observable under random sampling, but play a substantial role in determining
the optimal policy. Experimental results across different domains show that
ALOQ can learn more efficiently and robustly than existing methods.

本文提出一种名为 ALOQ 的方法，它结合了贝叶斯优化和贝叶斯积分来解决在考虑环境变量影响下找到鲁棒策略的问题，并且在实验中证明 ALOQ 比现有方法更高效和稳健。