Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three approximate MPI (AMPI) algorithms that are extensions of the well-known approximate DP algorithms: fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide an error propagation analysis for AMPI that unifies those for approximate policy and value iteration. We also provide a finite-sample analysis for the classification-based implementation of AMPI (CBMPI), which is more general (and somehow contains) than the analysis of the other presented AMPI algorithms. An interesting observation is that the MPI's parameter allows us to control the balance of errors (in value function approximation and in estimating the greedy policy) in the final performance of the CBMPI algorithm.

本文旨在探讨Modified Policy Iteration（MPI）算法的近似形式，提出了三种扩展的适应于大规模状态和动作空间的DP算法，包括拟合值迭代、拟合Q迭代和基于分类的策略迭代，并提供了统一的误差传播分析方法。同时，对于基于分类的实现，发展了有限样本分析，以显示MPI的主要参数如何控制分类器的估计误差和整体价值函数的近似程度。

近似改进策略迭代