The Robust Markov Decision Process (RMDP) framework focuses on designing
control policies that are robust against the parameter uncertainties due to the
mismatches between the simulator model and real-world settings. An RMDP problem
is typically formulated as a max-min problem, where the objective is to find
the policy that maximizes the value function for the worst possible model that
lies in an uncertainty set around a nominal model. The standard robust dynamic
programming approach requires the knowledge of the nominal model for computing
the optimal robust policy. In this work, we propose a model-based reinforcement
learning (RL) algorithm for learning an $\epsilon$-optimal robust policy when
the nominal model is unknown. We consider three different forms of uncertainty
sets, characterized by the total variation distance, chi-square divergence, and
KL divergence. For each of these uncertainty sets, we give a precise
characterization of the sample complexity of our proposed algorithm. In
addition to the sample complexity results, we also present a formal analytical
argument on the benefit of using robust policies. Finally, we demonstrate the
performance of our algorithm on two benchmark problems.

该研究提出了一种基于模型的强化学习算法，用于学习在标准和不确定的模型下最优的稳健控制策略，并考虑了不同形式的不确定性集合

通过生成模型实现鲁棒强化学习的样本复杂性

Sample Complexity of Robust Reinforcement Learning with a Generative  Model

This paper proposes adversarial attacks for Reinforcement Learning (RL) and
then improves the robustness of Deep Reinforcement Learning algorithms (DRL) to
parameter uncertainties with the help of these attacks. We show that even a
naively engineered attack successfully degrades the performance of DRL
algorithm. We further improve the attack using gradient information of an
engineered loss function which leads to further degradation in performance.
These attacks are then leveraged during training to improve the robustness of
RL within robust control framework. We show that this adversarial training of
DRL algorithms like Deep Double Q learning and Deep Deterministic Policy
Gradients leads to significant increase in robustness to parameter variations
for RL benchmarks such as Cart-pole, Mountain Car, Hopper and Half Cheetah
environment.

本文提出了针对强化学习的对抗攻击，并通过这些攻击提高了深度强化学习算法对参数不确定性的鲁棒性。我们展示了即使是一个简单的攻击也能成功降低深度强化学习算法的性能，并进一步使用工程丢失函数的梯度信息改进了攻击方法，导致性能进一步降低。这些攻击方法被用于训练中，以改善 RL 控制框架的鲁棒性。我们展示了在 Cart-pole，Mountain Car，Hopper 和 Half Cheetah 等 RL 基准测试环境中，对 DRL 算法进行对抗训练可以显著提高其对参数变化的鲁棒性。