BriefGPT.xyz
Jun, 2020
策略梯度方法的操作符视角
An operator view of policy gradient methods
HTML
PDF
Dibya Ghosh, Marlos C. Machado, Nicolas Le Roux
TL;DR
本文通过引入操作符的概念,将传统的强化学习算法中的策略梯度方法如REINFORCE和PPO等转化成了操作符形式,从而更好地理解它们的原理,同时通过引入新的全局下限,进一步弥合了基于策略和基于价值的方法之间的差距,将REINFORCE算法和贝尔曼最优化操作符视为同一概念的两个方面。
Abstract
We cast
policy gradient methods
as the repeated application of two operators: a
policy improvement operator
$\mathcal{I}$, which maps any policy $\pi$ to a better one $\mathcal{I}\pi$, and a
→