Joshua Romoff, Peter Henderson, Ahmed Touati, Yann Ollivier, Emma Brunskill...
TL;DR本文提出的TD(Delta)算法是一种针对有限horizon episodic reinforcement learning(RL)的value function approximator,通过将长时间horizon的值函数划分为components以解决标准TD学习中的缺陷。
Abstract
In many finite horizon episodic reinforcement learning (RL) settings, it is desirable to optimize for the undiscounted return - in settings like Atari, for instance, the goal is to collect the most points while staying alive in the long run. Yet, it may be difficult (or even intractabl