Reinforcement Learning (RL) algorithms have shown tremendous success in
simulation environments, but their application to real-world problems faces
significant challenges, with safety being a major concern. In particular,
enforcing state-wise constraints is essential for many challenging tasks such
as autonomous driving and robot manipulation. However, existing safe RL
algorithms under the framework of Constrained Markov Decision Process (CMDP) do
not consider state-wise constraints. To address this gap, we propose State-wise
Constrained Policy Optimization (SCPO), the first general-purpose policy search
algorithm for state-wise constrained reinforcement learning. SCPO provides
guarantees for state-wise constraint satisfaction in expectation. In
particular, we introduce the framework of Maximum Markov Decision Process, and
prove that the worst-case safety violation is bounded under SCPO. We
demonstrate the effectiveness of our approach on training neural network
policies for extensive robot locomotion tasks, where the agent must satisfy a
variety of state-wise safety constraints. Our results show that SCPO
significantly outperforms existing methods and can handle state-wise
constraints in high-dimensional robotics tasks.

State-wise Constrained Policy Optimization (SCPO) 是第一个面向状态限制的强化学习通用策略搜索算法，通过引入最大马尔科夫决策过程的框架，证明在期望下满足状态限制，并通过在高维机器人任务中的有效性证明显示 SCPO 显著优于现有方法。

各州受限制的政策优化

State-wise Constrained Policy Optimization

Despite the tremendous success of Reinforcement Learning (RL) algorithms in
simulation environments, applying RL to real-world applications still faces
many challenges. A major concern is safety, in another word, constraint
satisfaction. State-wise constraints are one of the most common constraints in
real-world applications and one of the most challenging constraints in Safe RL.
Enforcing state-wise constraints is necessary and essential to many challenging
tasks such as autonomous driving, robot manipulation. This paper provides a
comprehensive review of existing approaches that address state-wise constraints
in RL. Under the framework of State-wise Constrained Markov Decision Process
(SCMDP), we will discuss the connections, differences, and trade-offs of
existing approaches in terms of (i) safety guarantee and scalability, (ii)
safety and reward performance, and (iii) safety after convergence and during
training. We also summarize limitations of current methods and discuss
potential future directions.

本文综述了在强化学习中解决状态限制问题的现存方法并比较了它们在安全性、可伸缩性、奖励表现等方面的差异和权衡，同时总结了当前方法的局限性并探讨了未来的研究方向。