The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation. Implementations of this algorithm with several variants of the latter evaluation stage, e.g, $n$-step and trace-based returns, have been analyzed in previous works. However, the case of multiple-step lookahead policy improvement, despite the recent increase in empirical evidence of its strength, has to our knowledge not been carefully analyzed yet. In this work, we introduce the first such analysis. Namely, we formulate variants of multiple-step policy improvement, derive new algorithms using these definitions and prove their convergence. Moreover, we show that recent prominent Reinforcement Learning algorithms are, in fact, instances of our framework. We thus shed light on their empirical success and give a recipe for deriving new algorithms for future study.

本文研究了改进策略和评估策略之间交替的着名Policy Iteration算法，以及其变体中多步向前的政策改进，形成了多步政策改进的变量，导出了新的算法并证明了其收敛性。此外，文章还展示了近期著名的强化学习算法实际上是我们框架的实例，阐明了它们的经验成功，为未来研究提供了推导新算法的方法。

强化学习中超越单步贪心方法