Reinforcement learning (RL) agents are vulnerable to adversarial
disturbances, which can deteriorate task performance or compromise safety
specifications. Existing methods either address safety requirements under the
assumption of no adversary (e.g., safe RL) or only focus on robustness against
performance adversaries (e.g., robust RL). Learning one policy that is both
safe and robust remains a challenging open problem. The difficulty is how to
tackle two intertwined aspects in the worst cases: feasibility and optimality.
Optimality is only valid inside a feasible region, while identification of
maximal feasible region must rely on learning the optimal policy. To address
this issue, we propose a systematic framework to unify safe RL and robust RL,
including problem formulation, iteration scheme, convergence analysis and
practical algorithm design. This unification is built upon constrained
two-player zero-sum Markov games. A dual policy iteration scheme is proposed,
which simultaneously optimizes a task policy and a safety policy. The
convergence of this iteration scheme is proved. Furthermore, we design a deep
RL algorithm for practical implementation, called dually robust actor-critic
(DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC
achieves high performance and persistent safety under all scenarios (no
adversary, safety adversary, performance adversary), outperforming all
baselines significantly.

提出了一个系统的框架来统一安全强化学习和鲁棒强化学习的问题，包括问题的形式化、迭代方案、收敛性分析和实际算法设计。该框架建立在有约束的两人零和马尔可夫博弈上，提出了一种双重策略迭代方案，同时优化任务策略和安全策略。证明了该迭代方案的收敛性。此外，还设计了一种用于实际实现的深度强化学习算法，称为 DRAC。安全关键的基准评估表明，DRAC 在所有情景下（无对手、安全对手、性能对手）实现了高性能和持续的安全性，并且明显优于所有基准线。

具有双重鲁棒性的安全强化学习

Safe Reinforcement Learning with Dual Robustness

Existing on-policy imitation learning algorithms, such as DAgger, assume
access to a fixed supervisor. However, there are many settings where the
supervisor may evolve during policy learning, such as a human performing a
novel task or an improving algorithmic controller. We formalize imitation
learning from a "converging supervisor" and provide sublinear static and
dynamic regret guarantees against the best policy in hindsight with labels from
the converged supervisor, even when labels during learning are only from
intermediate supervisors. We then show that this framework is closely connected
to a class of reinforcement learning (RL) algorithms known as dual policy
iteration (DPI), which alternate between training a reactive learner with
imitation learning and a model-based supervisor with data from the learner.
Experiments suggest that when this framework is applied with the
state-of-the-art deep model-based RL algorithm PETS as an improving supervisor,
it outperforms deep RL baselines on continuous control tasks and provides up to
an 80-fold speedup in policy evaluation.

本文讨论了在学习策略时，监督者可能会变化的问题，并提出了一种从收敛监督者进行模仿学习并对其进行形式化。此外，作者将此框架与一类强化学习算法（DPI）相连，并在实验中使用最新的深度模型为监督者的方法在连续控制任务中获得了比深度强化学习基线更好的结果，并提供了多达 80 倍的策略评估加速。

基于收敛上级的同策略机器人仿真学习

On-Policy Robot Imitation Learning from a Converging Supervisor

Recently, a novel class of Approximate Policy Iteration (API) algorithms have
demonstrated impressive practical performance (e.g., ExIt from [2],
AlphaGo-Zero from [27]). This new family of algorithms maintains, and
alternately optimizes, two policies: a fast, reactive policy (e.g., a deep
neural network) deployed at test time, and a slow, non-reactive policy (e.g.,
Tree Search), that can plan multiple steps ahead. The reactive policy is
updated under supervision from the non-reactive policy, while the non-reactive
policy is improved with guidance from the reactive policy. In this work we
study this Dual Policy Iteration (DPI) strategy in an alternating optimization
framework and provide a convergence analysis that extends existing API theory.
We also develop a special instance of this framework which reduces the update
of non-reactive policies to model-based optimal control using learned local
models, and provides a theoretically sound way of unifying model-free and
model-based RL approaches with unknown dynamics. We demonstrate the efficacy of
our approach on various continuous control Markov Decision Processes.

本文提出了 Dual Policy Iteration 的概念，利用该框架有效地将模型无关和基于模型的强化学习方法与未知动力学结合起来，用于处理各种连续控制马尔可夫决策过程。