On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy Optimization (RPO), a novel on-policy extension that amalgamates past and future state-action information for policy optimization. This approach empowers the agent for introspection, allowing modifications to its actions within the current state. Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO's feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. The source code of this work is available at https://github.com/Edgargan/RPO.

该论文介绍了一种新的基于策略的扩展方法——反思性策略优化（RPO），它将过去和未来的状态-动作信息结合起来以进行策略优化，从而使智能体能够自我审视并在当前状态下修改其动作。理论分析证实了政策绩效的递增和解集空间的收缩，从而加快了收敛过程。经验证据表明，在两个强化学习基准测试中，RPO在样本效率方面表现出了显著的优势。

反思式策略优化