In this paper, we consider the problem of planning and learning in the infinite-horizon discounted-reward Markov decision problems. We propose a novel iterative direct policy-search approach, called dynamic policy programming (DPP). DPP is, to the best of our knowledge, the first convergent direct policy-search method that uses a Bellman-like iteration technique and at the same time is compatible with function approximation. For the tabular case, we prove that DPP converges asymptotically to the optimal policy. We numerically compare the performance of DPP to other state-of-the-art approximate dynamic programming methods on the mountain-car problem with linear function approximation and Gaussian basis functions. We observe that, unlike other approximate dynamic programming methods, DPP converges to a near-optimal policy, even when the basis functions are randomly placed. We conclude that DPP, combined with function approximation, asymptotically outperforms other approximate dynamic programming methods in the mountain-car problem.

本文提出了一种新的策略迭代方法——动态策略规划（DPP），用于在无限时间马尔可夫决策过程（MDP）中估计最优策略，证明了DPP在估计和近似误差存在的情况下的有限迭代和渐进l∞-norm性能损失边界，通过数值实验表明，与现有的强化学习方法相比，在所有情况下，基于DPP的算法表现出更好的性能。

动态策略编程