Reinforcement Learning (RL) for control has become increasingly popular due
to its ability to learn rich feedback policies that take into account
uncertainty and complex representations of the environment. When considering
safety constraints, constrained optimization approaches, where agents are
penalized for constraint violations, are commonly used. In such methods, if
agents are initialized in, or must visit, states where constraint violation
might be inevitable, it is unclear how much they should be penalized. We
address this challenge by formulating a constraint on the counterfactual harm
of the learned policy compared to a default, safe policy. In a philosophical
sense this formulation only penalizes the learner for constraint violations
that it caused; in a practical sense it maintains feasibility of the optimal
control problem. We present simulation studies on a rover with uncertain road
friction and a tractor-trailer parking environment that demonstrate our
constraint formulation enables agents to learn safer policies than contemporary
constrained RL methods.

通过对学习策略与默认的安全策略进行比较，我们提出了一种对反事实伤害进行约束的方法，在考虑不确定性和复杂环境表示的基础上实现了学习更安全策略的目的。