BriefGPT.xyz
Nov, 2016
结合策略梯度与Q-learning
PGQ: Combining policy gradient and Q-learning
HTML
PDF
Brendan O'Donoghue, Remi Munos, Koray Kavukcuoglu, Volodymyr Mnih
TL;DR
本文提出了一种新技术,将策略梯度与Q-learning相结合,通过回放缓冲提取On-policy数据,从策略的动作偏好中估计Q值,并应用Q-learning更新。实验结果表明,这种PGQL技术在全套Atari游戏中的性能超过了异步优势actor-critic(A3C)和Q-learning,能够提高数据效率和稳定性。
Abstract
policy gradient
is an efficient technique for improving a policy in a
reinforcement learning
setting. However, vanilla online variants are on-policy only and not able to take advantage of
→