Online safe reinforcement learning (RL) involves training a policy that
maximizes task efficiency while satisfying constraints via interacting with the
environments. In this paper, our focus lies in addressing the complex
challenges associated with solving multi-constraint (MC) safe RL problems. We
approach the safe RL problem from the perspective of Multi-Objective
Optimization (MOO) and propose a unified framework designed for MC safe RL
algorithms. This framework highlights the manipulation of gradients derived
from constraints. Leveraging insights from this framework and recognizing the
significance of \textit{redundant} and \textit{conflicting} constraint
conditions, we introduce the Gradient Shaping (GradS) method for general
Lagrangian-based safe RL algorithms to improve the training efficiency in terms
of both reward and constraint satisfaction. Our extensive experimentation
demonstrates the effectiveness of our proposed method in encouraging
exploration and learning a policy that improves both safety and reward
performance across various challenging MC safe RL tasks as well as good
scalability to the number of constraints.

利用多目标优化（MOO）的统一框架来解决复杂的多约束（MC）安全强化学习（safe RL）问题，通过操纵约束条件的梯度，引入梯度塑形（GradS）方法来改善训练效率，实验证明该方法在各种具有挑战性的 MC 安全 RL 任务中提高了探索性和学习策略的效果，同时对约束数量的扩展性表现良好。

多约束安全增强学习的梯度塑形

Gradient Shaping for Multi-Constraint Safe Reinforcement Learning

We examine online safe multi-agent reinforcement learning using constrained
Markov games in which agents compete by maximizing their expected total rewards
under a constraint on expected total utilities. Our focus is confined to an
episodic two-player zero-sum constrained Markov game with independent
transition functions that are unknown to agents, adversarial reward functions,
and stochastic utility functions. For such a Markov game, we employ an approach
based on the occupancy measure to formulate it as an online constrained
saddle-point problem with an explicit constraint. We extend the Lagrange
multiplier method in constrained optimization to handle the constraint by
creating a generalized Lagrangian with minimax decision primal variables and a
dual variable. Next, we develop an upper confidence reinforcement learning
algorithm to solve this Lagrangian problem while balancing exploration and
exploitation. Our algorithm updates the minimax decision primal variables via
online mirror descent and the dual variable via projected gradient step and we
prove that it enjoys sublinear rate $ O((|X|+|Y|) L \sqrt{T(|A|+|B|)}))$ for
both regret and constraint violation after playing $T$ episodes of the game.
Here, $L$ is the horizon of each episode, $(|X|,|A|)$ and $(|Y|,|B|)$ are the
state/action space sizes of the min-player and the max-player, respectively. To
the best of our knowledge, we provide the first provably efficient online safe
reinforcement learning algorithm in constrained Markov games.

本文提出一种使用基于占用测度的拉格朗日优化方法来解决约束马尔可夫博弈的在线安全强化学习算法，经更新的 minimax 决策原始变量和双重变量，达到亚线性后悔率和约束违规率，实现对马尔可夫博弈的高效学习。