简单政策优化

Jan, 2024

Simple Policy Optimization

Zhengpeng Xie

TL;DR本文介绍了SPO（简化策略优化）算法，该算法通过引入一种新的KL散度夹紧方法，能够在几乎所有环境中有效地强制执行信任区域约束，同时仍然保持一阶算法的简单性。在Atari 2600环境中进行的比较实验表明，SPO有时比PPO算法更强大。

Abstract

PPO (Proximal Policy Optimization) algorithm has demonstrated excellent performance in many fields, and it is considered as a simple version of TRPO (Trust Region Policy Optimization) algorithm. However, the ratio clipping operation in PPO may not always effectively enforce the trust region c