策略优化中可证明高效的探索

Dec, 2019

Provably Efficient Exploration in Policy Optimization

Qi Cai, Zhuoran Yang, Chi Jin, Zhaoran Wang

TL;DR本文提出了一种Proximal Policy Optimization算法的乐观变异版本（OPPO），它实现了在带有线性函数拟合、未知转移和对抗奖励的情况下，探索机制下的近似最优解，是第一种实现这一目标的算法。

Abstract

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a →