Safe Policy Improvement (SPI) aims at provable guarantees that a learned
policy is at least approximately as good as a given baseline policy. Building
on SPI with Soft Baseline Bootstrapping (Soft-SPIBB) by Nadjahi et al., we
identify theoretical issues in their approach, provide a corrected theory, and
derive a new algorithm that is provably safe on finite Markov Decision
Processes (MDP). Additionally, we provide a heuristic algorithm that exhibits
the best performance among many state of the art SPI algorithms on two
different benchmarks. Furthermore, we introduce a taxonomy of SPI algorithms
and empirically show an interesting property of two classes of SPI algorithms:
while the mean performance of algorithms that incorporate the uncertainty as a
penalty on the action-value is higher, actively restricting the set of policies
more consistently produces good policies and is, thus, safer.

介绍了一个新的算法，它可以在有限的马尔可夫决策过程上提供安全保障，并且在两个基准测试中展现出最佳表现。同时，提出了一个 SPI 算法的分类法，发现想法限制政策集合的算法更为安全。