BriefGPT.xyz
Nov, 2016
通过探索未被重视的奖励来改进政策梯度
Improving Policy Gradient by Exploring Under-appreciated Rewards
HTML
PDF
Ofir Nachum, Mohammad Norouzi, Dale Schuurmans
TL;DR
本文提出了一种新颖的无模型强化学习策略梯度算法,采用基于概率的有指导性的探索策略,相比现有熵正则化方法更有效地探索高维度的稀疏奖励空间,并在一系列算法任务上得到了成功的应用。
Abstract
This paper presents a novel form of
policy gradient
for model-free
reinforcement learning
(RL) with improved
exploration
properties. Curre
→