Safety is a primary concern when applying reinforcement learning to
real-world control tasks, especially in the presence of external disturbances.
However, existing safe reinforcement learning algorithms rarely account for
external disturbances, limiting their applicability and robustness in practice.
To address this challenge, this paper proposes a robust safe reinforcement
learning framework that tackles worst-case disturbances. First, this paper
presents a policy iteration scheme to solve for the robust invariant set, i.e.,
a subset of the safe set, where persistent safety is only possible for states
within. The key idea is to establish a two-player zero-sum game by leveraging
the safety value function in Hamilton-Jacobi reachability analysis, in which
the protagonist (i.e., control inputs) aims to maintain safety and the
adversary (i.e., external disturbances) tries to break down safety. This paper
proves that the proposed policy iteration algorithm converges monotonically to
the maximal robust invariant set. Second, this paper integrates the proposed
policy iteration scheme into a constrained reinforcement learning algorithm
that simultaneously synthesizes the robust invariant set and uses it for
constrained policy optimization. This algorithm tackles both optimality and
safety, i.e., learning a policy that attains high rewards while maintaining
safety under worst-case disturbances. Experiments on classic control tasks show
that the proposed method achieves zero constraint violation with learned
worst-case adversarial disturbances, while other baseline algorithms violate
the safety constraints substantially. Our proposed method also attains
comparable performance as the baselines even in the absence of the adversary.

该论文提出了一种鲁棒安全强化学习框架，解决了在真实控制任务中应用强化学习时外部干扰的安全性问题，该框架通过建立鲁棒不变集合来保证安全，并采用约束强化学习算法进行策略优化。

针对对抗性干扰的坚固安全强化学习

Robust Safe Reinforcement Learning under Adversarial Disturbances

Existing learning approaches to dexterous manipulation use demonstrations or
interactions with the environment to train black-box neural networks that
provide little control over how the robot learns the skills or how it would
perform post training. These approaches pose significant challenges when
implemented on physical platforms given that, during initial stages of
training, the robot's behavior could be erratic and potentially harmful to its
own hardware, the environment, or any humans in the vicinity. A potential way
to address these limitations is to add constraints during learning that
restrict and guide the robot's behavior during training as well as roll outs.
Inspired by the success of constrained approaches in other domains, we
investigate the effects of adding position-based constraints to a 24-DOF robot
hand learning to perform object relocation using Constrained Policy
Optimization. We find that a simple geometric constraint can ensure the robot
learns to move towards the object sooner than without constraints. Further,
training with this constraint requires a similar number of samples as its
unconstrained counterpart to master the skill. These findings shed light on how
simple constraints can help robots achieve sensible and safe behavior quickly
and ease concerns surrounding hardware deployment. We also investigate the
effects of the strictness of these constraints and report findings that provide
insights into how different degrees of strictness affect learning outcomes. Our
code is available at
this https URL.

本文研究了在 24 DOF 机器人手器学习使用约束策略优化来执行对象重定位任务，并发现在学习期间添加约束能确保机器人更快达到目标点，从而使其具有更加稳健和安全的行为能力。

熟练操纵的约束强化学习

Constrained Reinforcement Learning for Dexterous Manipulation

Solving tasks in Reinforcement Learning is no easy feat. As the goal of the
agent is to maximize the accumulated reward, it often learns to exploit
loopholes and misspecifications in the reward signal resulting in unwanted
behavior. While constraints may solve this issue, there is no closed form
solution for general constraints. In this work we present a novel
multi-timescale approach for constrained policy optimization, called `Reward
Constrained Policy Optimization' (RCPO), which uses an alternative penalty
signal to guide the policy towards a constraint satisfying one. We prove the
convergence of our approach and provide empirical evidence of its ability to
train constraint satisfying policies.

提出了一种名为 “奖励约束策略优化（RCPO）” 的多时间尺度方法，该方法使用替代惩罚信号引导策略满足约束，并证明了该方法的收敛性和训练满足约束的策略的能力。