Conservative Policy Iteration (CPI) is a founding algorithm of Approximate
Dynamic Programming (ADP). Its core principle is to stabilize greediness
through stochastic mixtures of consecutive policies. It comes with strong
theoretical guarantees, and inspired approaches in deep Reinforcement Learning
(RL). However, CPI itself has rarely been implemented, never with neural
networks, and only experimented on toy problems. In this paper, we show how CPI
can be practically combined with deep RL with discrete actions. We also
introduce adaptive mixture rates inspired by the theory. We experiment
thoroughly the resulting algorithm on the simple Cartpole problem, and validate
the proposed method on a representative subset of Atari games. Overall, this
work suggests that revisiting classic ADP may lead to improved and more stable
deep RL algorithms.

本文研究了将经典的 Conservative Policy Iteration 算法应用于深度强化学习中的实际问题，并引入了自适应混合比率的概念，通过在 Cartpole 问题和 Atari 游戏中的实验验证了该算法的有效性和稳定性，表明重新审视经典的 Approximate Dynamic Programming 可能会导致改进和更稳定的深度强化学习算法。