Anirudh Goyal, Philemon Brakel, William Fedus, Timothy Lillicrap, Sergey Levine...
TL;DR通过回溯模型和回溯的方式,可以在强化学习中发现更多高奖励状态,从而提高状态采样的效率。
Abstract
In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them