BriefGPT.xyz
Feb, 2021
策略梯度方差减少方法的收敛和样本效率
On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method
HTML
PDF
Junyu Zhang, Chengzhuo Ni, Zheng Yu, Csaba Szepesvari, Mengdi Wang
TL;DR
本研究提出一种简单且有效的梯度截断机制,可用于加速政策梯度算法的变化减少技术,进而设计了一种名为TSIVR-PG的新方法,它不仅能够最大化累积奖励总和,还能在政策的长期访问分布上最大化一般效用函数,并对TSIVR-PG进行了理论分析。
Abstract
policy gradient
gives rise to a rich class of
reinforcement learning
(RL) methods, for example the REINFORCE. Yet the best known sample complexity result for such methods to find an $\epsilon$-optimal policy is $
→