Policy-based reinforcement learning algorithms are widely used in various fields. Among them, mainstream policy optimization algorithms such as PPO and TRPO introduce importance sampling into reinforcement learning, which allows the reuse of historical data. However, this also results in high variance of the surrogate objective and indirectly affects the stability and convergence of the algorithm. In this paper, we first derived an upper bound of the variance of the surrogate objective, which can grow quadratically with the increase of the surrogate objective. Next, we proposed a dropout technique to avoid the excessive increase of the surrogate objective variance caused by importance sampling. Then, we introduced a general reinforcement learning framework applicable to mainstream policy optimization methods, and applied the dropout technique to the PPO algorithm to obtain the D-PPO variant. Finally, we conduct comparative experiments between D-PPO and PPO algorithms in the Atari 2600 environment, results show that D-PPO achieved significant performance improvements compared to PPO, and effectively limited the excessive increase of the surrogate objective variance during training.

本文提出了一种适用于主流政策优化算法的强化学习框架，通过引入一种称为dropout技术的方法，避免了由于重要性采样而导致的代理目标方差的过度增加，并验证了在Atari 2600环境中，D-PPO相对于PPO算法在性能上取得了显著的改进，有效限制了训练过程中代理目标方差的过度增加。

强化学习中的退化策略：限制策略优化方法中的替代目标方差