基于非参状态熵估计的策略梯度无任务探索

Jul, 2020

基于非参状态熵估计的策略梯度无任务探索

A Policy Gradient Method for Task-Agnostic Exploration

Mirco Mutti, Lorenzo Pratissoli, Marcello Restelli

TL;DR本文通过提出新的策略搜索算法MEPOL（Maximum Entropy POLicy optimization)，并在实验中展示了它在高维、连续控制领域中学习最大熵策略的能力，为研究agent在无奖励环境中探索最优策略的内在目标提供了一种可行的选择。

Abstract

In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy? In this paper, we argue that the