Safety is an indispensable requirement for applying reinforcement learning
(RL) to real problems. Although there has been a surge of safe RL algorithms
proposed in recent years, most existing work typically 1) relies on receiving
numeric safety feedback; 2) does not guarantee safety during the learning
process; 3) limits the problem to a priori known, deterministic transition
dynamics; and/or 4) assume the existence of a known safe policy for any states.
Addressing the issues mentioned above, we thus propose Long-term Binaryfeedback
Safe RL (LoBiSaRL), a safe RL algorithm for constrained Markov decision
processes (CMDPs) with binary safety feedback and an unknown, stochastic state
transition function. LoBiSaRL optimizes a policy to maximize rewards while
guaranteeing a long-term safety that an agent executes only safe state-action
pairs throughout each episode with high probability. Specifically, LoBiSaRL
models the binary safety function via a generalized linear model (GLM) and
conservatively takes only a safe action at every time step while inferring its
effect on future safety under proper assumptions. Our theoretical results show
that LoBiSaRL guarantees the long-term safety constraint, with high
probability. Finally, our empirical results demonstrate that our algorithm is
safer than existing methods without significantly compromising performance in
terms of reward.

LoBiSaRL 是一种安全的强化学习算法，应用于有约束的马尔科夫决策过程中，通过二进制安全反馈和未知的随机状态转移函数来保证长期安全约束。