BriefGPT.xyz
Jun, 2019
策略梯度方法全局收敛到(几乎)局部最优策略
Global Convergence of Policy Gradient Methods to (Almost) Locally Optimal Policies
HTML
PDF
Kaiqing Zhang, Alec Koppel, Hao Zhu, Tamer Başar
TL;DR
本研究从非凸优化的角度出发,提出一种新的PG method变体,利用随机滚动谱估计策略梯度,实现策略梯度的无偏估计,并在严格鞍点假设下,证明了算法的收敛性。最终,实验证明,通过重新设计奖赏函数,可以避免不良鞍点并获得更好的极限点。
Abstract
Policy gradient (PG) methods are a widely used
reinforcement learning
methodology in many applications such as videogames, autonomous driving, and robotics. In spite of its empirical success, a rigorous understanding of the global
→