Reinforcement Learning (RL) algorithms have shown tremendous success in
simulation environments, but their application to real-world problems faces
significant challenges, with safety being a major concern. In particular,
enforcing state-wise constraints is essential for many challenging tasks such
as autonomous driving and robot manipulation. However, existing safe RL
algorithms under the framework of Constrained Markov Decision Process (CMDP) do
not consider state-wise constraints. To address this gap, we propose State-wise
Constrained Policy Optimization (SCPO), the first general-purpose policy search
algorithm for state-wise constrained reinforcement learning. SCPO provides
guarantees for state-wise constraint satisfaction in expectation. In
particular, we introduce the framework of Maximum Markov Decision Process, and
prove that the worst-case safety violation is bounded under SCPO. We
demonstrate the effectiveness of our approach on training neural network
policies for extensive robot locomotion tasks, where the agent must satisfy a
variety of state-wise safety constraints. Our results show that SCPO
significantly outperforms existing methods and can handle state-wise
constraints in high-dimensional robotics tasks.

State-wise Constrained Policy Optimization (SCPO) 是第一个面向状态限制的强化学习通用策略搜索算法，通过引入最大马尔科夫决策过程的框架，证明在期望下满足状态限制，并通过在高维机器人任务中的有效性证明显示 SCPO 显著优于现有方法。

各州受限制的政策优化

State-wise Constrained Policy Optimization

Recent studies have shown that episodic reinforcement learning (RL) is no
harder than bandits when the total reward is bounded by $1$, and proved regret
bounds that have a polylogarithmic dependence on the planning horizon $H$.
However, it remains an open question that if such results can be carried over
to adversarial RL, where the reward is adversarially chosen at each episode. In
this paper, we answer this question affirmatively by proposing the first
horizon-free policy search algorithm. To tackle the challenges caused by
exploration and adversarially chosen reward, our algorithm employs (1) a
variance-uncertainty-aware weighted least square estimator for the transition
kernel; and (2) an occupancy measure-based technique for the online search of a
\emph{stochastic} policy. We show that our algorithm achieves an
$\tilde{O}\big((d+\log (|\mathcal{S}|^2 |\mathcal{A}|))\sqrt{K}\big)$ regret
with full-information feedback, where $d$ is the dimension of a known feature
mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is
the number of episodes, $|\mathcal{S}|$ and $|\mathcal{A}|$ are the
cardinalities of the state and action spaces. We also provide hardness results
and regret lower bounds to justify the near optimality of our algorithm and the
unavoidability of $\log|\mathcal{S}|$ and $\log|\mathcal{A}|$ in the regret
bound.

本文通过提出第一个无界时间步长多次对抗强化学习的策略搜索算法，使用方差 - 不确定性感知加权最小二乘估计器和基于占用度量的在线搜索技术，以解决探索和对抗性奖励所带来的挑战，证明算法在全信息反馈下具有 O ((d+log (|S|^2|A|)) sqrt (K)) 的后悔界，其中 d 是未知转移核线性参数化的已知特征映射的维数，K 是剧集数量，|S| 和 |A| 是状态和行为空间的基数。