In robotic control tasks, policies trained by reinforcement learning (RL) in
simulation often experience a performance drop when deployed on physical
hardware, due to modeling error, measurement error, and unpredictable
perturbations in the real world. Robust RL methods account for this issue by
approximating a worst-case value function during training, but they can be
sensitive to approximation errors in the value function and its gradient before
training is complete. In this paper, we hypothesize that Lipschitz
regularization can help condition the approximated value function gradients,
leading to improved robustness after training. We test this hypothesis by
combining Lipschitz regularization with an application of Fast Gradient Sign
Method to reduce approximation errors when evaluating the value function under
adversarial perturbations. Our empirical results demonstrate the benefits of
this approach over prior work on a number of continuous control benchmarks.

在机器人控制任务中，强化学习（RL）在模拟中训练的策略在部署到物理硬件上时常常出现性能下降的问题，本文研究了通过利普希茨正则化来改善近似值函数的梯度条件，从而提高训练后的鲁棒性。通过将利普希茨正则化与快速梯度符号方法相结合，我们的实验结果表明了这种方法在一些连续控制基准测试中的优势。

显式利普希茨值估计增强策略对扰动的稳健性

Explicit Lipschitz Value Estimation Enhances Policy Robustness Against  Perturbation

Reinforcement learning (RL) agents are vulnerable to adversarial
disturbances, which can deteriorate task performance or compromise safety
specifications. Existing methods either address safety requirements under the
assumption of no adversary (e.g., safe RL) or only focus on robustness against
performance adversaries (e.g., robust RL). Learning one policy that is both
safe and robust remains a challenging open problem. The difficulty is how to
tackle two intertwined aspects in the worst cases: feasibility and optimality.
Optimality is only valid inside a feasible region, while identification of
maximal feasible region must rely on learning the optimal policy. To address
this issue, we propose a systematic framework to unify safe RL and robust RL,
including problem formulation, iteration scheme, convergence analysis and
practical algorithm design. This unification is built upon constrained
two-player zero-sum Markov games. A dual policy iteration scheme is proposed,
which simultaneously optimizes a task policy and a safety policy. The
convergence of this iteration scheme is proved. Furthermore, we design a deep
RL algorithm for practical implementation, called dually robust actor-critic
(DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC
achieves high performance and persistent safety under all scenarios (no
adversary, safety adversary, performance adversary), outperforming all
baselines significantly.

提出了一个系统的框架来统一安全强化学习和鲁棒强化学习的问题，包括问题的形式化、迭代方案、收敛性分析和实际算法设计。该框架建立在有约束的两人零和马尔可夫博弈上，提出了一种双重策略迭代方案，同时优化任务策略和安全策略。证明了该迭代方案的收敛性。此外，还设计了一种用于实际实现的深度强化学习算法，称为 DRAC。安全关键的基准评估表明，DRAC 在所有情景下（无对手、安全对手、性能对手）实现了高性能和持续的安全性，并且明显优于所有基准线。

具有双重鲁棒性的安全强化学习

Safe Reinforcement Learning with Dual Robustness

This thesis rigorously studies fundamental reinforcement learning (RL)
methods in modern practical considerations, including robust RL, distributional
RL, and offline RL with neural function approximation. The thesis first
prepares the readers with an overall overview of RL and key technical
background in statistics and optimization. In each of the settings, the thesis
motivates the problems to be studied, reviews the current literature, provides
computationally efficient algorithms with provable efficiency guarantees, and
concludes with future research directions. The thesis makes fundamental
contributions to the three settings above, both algorithmically, theoretically,
and empirically, while staying relevant to practical considerations.

本文旨在研究多种强化学习方法如鲁棒性 RL，分布式 RL 和离线 RL，并为每个方法提供算法以及未来的相关研究方向。