基于记忆的轨迹条件策略在稀疏奖励学习中的应用

Jul, 2019

基于记忆的轨迹条件策略在稀疏奖励学习中的应用

Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy

Yijie Guo, Jongwook Choi, Marcin Moczulski, Samy Bengio, Mohammad Norouzi...

TL;DR本文提出了一种基于轨迹条件的策略学习方法，通过从内存缓冲区中展开多种多样的过去轨迹，可帮助策略创造者更好地探索状态空间，并在各种复杂任务中显著提高模型性能。（本方法可以不用专家演示或将模型重置为任意状态，在 Atari 游戏Montezuma's Revenge和Pitfall的五十亿帧内取得了最先进的得分）

Abstract

This paper proposes a method for learning a trajectory-conditioned policy to imitate diverse demonstrations from the agent's own past experiences. We demonstrate that such self-imitation drives exploration in diverse directions and increases the chance of finding a globally optimal sol