We present a policy optimization framework in which the learned policy comes
with a machine-checkable certificate of adversarial robustness. Our approach,
called CAROL, learns a model of the environment. In each learning iteration, it
uses the current version of this model and an external abstract interpreter to
construct a differentiable signal for provable robustness. This signal is used
to guide policy learning, and the abstract interpretation used to construct it
directly leads to the robustness certificate returned at convergence. We give a
theoretical analysis that bounds the worst-case accumulative reward of CAROL.
We also experimentally evaluate CAROL on four MuJoCo environments. On these
tasks, which involve continuous state and action spaces, CAROL learns certified
policies that have performance comparable to the (non-certified) policies
learned using state-of-the-art robust RL methods.

本文介绍了一种基于证明高鲁棒性的策略优化框架，称为 CAROL，在学习环境模型的同时使用外部的抽象解释器来构建可微分信号来指导策略学习，并直接导致在收敛时返回的高鲁棒性证书。 在四个 MuJoCo 环境中的实验评估显示，CAROL 能够学习到与使用最先进的鲁棒 RL 方法学习到的非认证策略性能相当的认证策略。

具有鲁棒性证明的策略优化

Policy Optimization with Robustness Certificates

We present Revel, a partially neural reinforcement learning (RL) framework
for provably safe exploration in continuous state and action spaces. A key
challenge for provably safe deep RL is that repeatedly verifying neural
networks within a learning loop is computationally infeasible. We address this
challenge using two policy classes: a general, neurosymbolic class with
approximate gradients and a more restricted class of symbolic policies that
allows efficient verification. Our learning algorithm is a mirror descent over
policies: in each iteration, it safely lifts a symbolic policy into the
neurosymbolic space, performs safe gradient updates to the resulting policy,
and projects the updated policy into the safe symbolic subset, all without
requiring explicit verification of neural networks. Our empirical results show
that Revel enforces safe exploration in many scenarios in which Constrained
Policy Optimization does not, and that it can discover policies that outperform
those learned through prior approaches to verified exploration.

提出 Revel：一种部分神经强化学习（RL）框架，用于在连续状态和动作空间中保证安全探索。通过两个策略分类来解决神经网络验证中的计算难题，并将学习算法投射到安全符号子集中，从而实现不需要显式验证神经网络的安全探索。实验结果显示，Revel 能在许多场景中实现安全探索，并能发现优于以往验证探索方法的政策。

具有形式验证探索的神经符号强化学习

Neurosymbolic Reinforcement Learning with Formally Verified Exploration

Hierarchical agents have the potential to solve sequential decision making
tasks with greater sample efficiency than their non-hierarchical counterparts
because hierarchical agents can break down tasks into sets of subtasks that
only require short sequences of decisions. In order to realize this potential
of faster learning, hierarchical agents need to be able to learn their multiple
levels of policies in parallel so these simpler subproblems can be solved
simultaneously. Yet, learning multiple levels of policies in parallel is hard
because it is inherently unstable: changes in a policy at one level of the
hierarchy may cause changes in the transition and reward functions at higher
levels in the hierarchy, making it difficult to jointly learn multiple levels
of policies. In this paper, we introduce a new Hierarchical Reinforcement
Learning (HRL) framework, Hierarchical Actor-Critic (HAC), that can overcome
the instability issues that arise when agents try to jointly learn multiple
levels of policies. The main idea behind HAC is to train each level of the
hierarchy independently of the lower levels by training each level as if the
lower level policies are already optimal. We demonstrate experimentally in both
grid world and simulated robotics domains that our approach can significantly
accelerate learning relative to other non-hierarchical and hierarchical
methods. Indeed, our framework is the first to successfully learn 3-level
hierarchies in parallel in tasks with continuous state and action spaces.

本文介绍了一种新的 Hierarchical Reinforcement Learning (HRL) 框架 - Hierarchical Actor-Critic (HAC)，该框架能够克服在试图同时学习多个策略层级时出现的不稳定性问题，并能够在连续状态和动作空间的任务中成功地学习 3 级层级。