Safe reinforcement learning has been a promising approach for optimizing the
policy of an agent that operates in safety-critical applications. In this
paper, we propose an algorithm, SNO-MDP, that explores and optimizes Markov
decision processes under unknown safety constraints. Specifically, we take a
stepwise approach for optimizing safety and cumulative reward. In our method,
the agent first learns safety constraints by expanding the safe region, and
then optimizes the cumulative reward in the certified safe region. We provide
theoretical guarantees on both the satisfaction of the safety constraint and
the near-optimality of the cumulative reward under proper regularity
assumptions. In our experiments, we demonstrate the effectiveness of SNO-MDP
through two experiments: one uses a synthetic data in a new, openly-available
environment named GP-SAFETY-GYM, and the other simulates Mars surface
exploration by using real observation data.

该研究提出了一种名为 SNO-MDP 的算法，它可以在未知安全约束条件下探索和优化马尔可夫决策过程，通过扩展安全区域来学习安全约束条件，进而在已认证的安全区域内优化累积奖励。通过两个实验展示了该算法的有效性。