BriefGPT.xyz
Sep, 2019
V-MPO:基于最大后验策略优化的离散与连续控制的策略更新算法
V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control
HTML
PDF
H. Francis Song, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan Clark, Hubert Soyer...
TL;DR
本文研究了一种新的强化学习方法V-MPO,基于学习到的状态值函数进行策略迭代以提高性能,在多个测试套件中实现了更好的成绩,在高维度、连续动作空间的问题中也取得了成功。
Abstract
Some of the most successful applications of
deep reinforcement learning
to challenging domains in discrete and continuous control have used
policy gradient methods
in the on-policy setting. However, policy gradie
→