Proximal Policy Optimization (PPO) is a popular model-free reinforcement
learning algorithm, esteemed for its simplicity and efficacy. However, due to
its inherent on-policy nature, its proficiency in harnessing data from
disparate policies is constrained. This paper introduces a novel off-policy
extension to the original PPO method, christened Transductive Off-policy PPO
(ToPPO). Herein, we provide theoretical justification for incorporating
off-policy data in PPO training and prudent guidelines for its safe
application. Our contribution includes a novel formulation of the policy
improvement lower bound for prospective policies derived from off-policy data,
accompanied by a computationally efficient mechanism to optimize this bound,
underpinned by assurances of monotonic improvement. Comprehensive experimental
results across six representative tasks underscore ToPPO's promising
performance.

这篇论文介绍了一种名为 Transductive Off-policy PPO（ToPPO）的新型离策略 PPO 方法，通过引入离策略数据，提供了在 PPO 训练中结合离策略数据的理论依据和安全应用的指导，包括从离策略数据中得出潜在策略的政策改进下界的新型公式以及优化该下界的高效机制，并通过全面实验结果展示了 ToPPO 的良好性能。