BriefGPT.xyz
Jul, 2020
基于非参状态熵估计的策略梯度无任务探索
A Policy Gradient Method for Task-Agnostic Exploration
HTML
PDF
Mirco Mutti, Lorenzo Pratissoli, Marcello Restelli
TL;DR
本文通过提出新的策略搜索算法MEPOL(Maximum Entropy POLicy optimization),并在实验中展示了它在高维、连续控制领域中学习最大熵策略的能力,为研究agent在无奖励环境中探索最优策略的内在目标提供了一种可行的选择。
Abstract
In a reward-free environment, what is a suitable
intrinsic objective
for an agent to pursue so that it can learn an optimal task-agnostic
exploration policy
? In this paper, we argue that the
→