Safe reinforcement learning (RL) aims to solve an optimal control problem
under safety constraints. Existing $\textit{direct}$ safe RL methods use the
original constraint throughout the learning process. They either lack
theoretical guarantees of the policy during iteration or suffer from
infeasibility problems. To address this issue, we propose an
$\textit{indirect}$ safe RL method called feasible policy iteration (FPI) that
iteratively uses the feasible region of the last policy to constrain the
current policy. The feasible region is represented by a feasibility function
called constraint decay function (CDF). The core of FPI is a region-wise policy
update rule called feasible policy improvement, which maximizes the return
under the constraint of the CDF inside the feasible region and minimizes the
CDF outside the feasible region. This update rule is always feasible and
ensures that the feasible region monotonically expands and the state-value
function monotonically increases inside the feasible region. Using the feasible
Bellman equation, we prove that FPI converges to the maximum feasible region
and the optimal state-value function. Experiments on classic control tasks and
Safety Gym show that our algorithms achieve lower constraint violations and
comparable or higher performance than the baselines.

本文研究安全强化学习问题，提出了一种名为可行策略迭代算法的间接安全强化学习方法，该算法通过使用一个称为约束衰减函数的可行性函数表示可行域，实现了保证策略的约束和可行性并达到优化目标。实验表明，可行策略迭代算法在经典控制任务和安全场景中能够取得更好的表现。