BriefGPT.xyz
Dec, 2019
策略优化中可证明高效的探索
Provably Efficient Exploration in Policy Optimization
HTML
PDF
Qi Cai, Zhuoran Yang, Chi Jin, Zhaoran Wang
TL;DR
本文提出了一种Proximal Policy Optimization算法的乐观变异版本(OPPO),它实现了在带有线性函数拟合、未知转移和对抗奖励的情况下,探索机制下的近似最优解,是第一种实现这一目标的算法。
Abstract
While
policy-based reinforcement learning
(RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a
→