Robust reinforcement learning aims to produce policies that have strong
guarantees even in the face of environments/transition models whose parameters
have strong uncertainty. Existing work uses value-based methods and the usual
primitive action setting. In this paper, we propose robust methods for learning
temporally abstract actions, in the framework of options. We present a Robust
Options Policy Iteration (ROPI) algorithm with convergence guarantees, which
learns options that are robust to model uncertainty. We utilize ROPI to learn
robust options with the Robust Options Deep Q Network (RO-DQN) that solves
multiple tasks and mitigates model misspecification due to model uncertainty.
We present experimental results which suggest that policy iteration with linear
features may have an inherent form of robustness when using coarse feature
representations. In addition, we present experimental results which demonstrate
that robustness helps policy iteration implemented on top of deep neural
networks to generalize over a much broader range of dynamics than non-robust
policy iteration.

本研究介绍了一种名为 ROPI 的算法，用于在存在模型不确定性的情况下学习具有鲁棒性的选项。此外，我们还使用 RO-DQN 解决多个任务并缓解了由于模型不确定性而导致的模型错误，实验结果表明，具有粗糙特征表示时使用线性特征的策略迭代具有固有的鲁棒性。另外，我们的实验结果证明，鲁棒性有助于在深度神经网络之上实现的策略迭代，能够推广到比非鲁棒性的策略迭代更广泛的动力学范围。

学习鲁棒的选项

Learning Robust Options

We investigate the use of temporally abstract actions, or macro-actions, in
the solution of Markov decision processes. Unlike current models that combine
both primitive actions and macro-actions and leave the state space unchanged,
we propose a hierarchical model (using an abstract MDP) that works with
macro-actions only, and that significantly reduces the size of the state space.
This is achieved by treating macroactions as local policies that act in certain
regions of state space, and by restricting states in the abstract MDP to those
at the boundaries of regions. The abstract MDP approximates the original and
can be solved more efficiently. We discuss several ways in which macro-actions
can be generated to ensure good solution quality. Finally, we consider ways in
which macro-actions can be reused to solve multiple, related MDPs; and we show
that this can justify the computational overhead of macro-action generation.

本文提出了一种使用抽象 MDP 的分层模型，该模型仅与 Macro-actions 一起工作，并显着减少了状态空间的大小，以及讨论了生成 Macro-actions 的几种方法和重用它们以解决多个相关 MDPs 的方式。