We propose Episodic Backward Update - a new algorithm to boost the performance of a deep reinforcement learning agent by a fast reward propagation. In contrast to the conventional use of the experience replay with uniform random sampling, our agent samples a whole episode and successively propagates the value of a state to its previous states. Our computationally efficient recursive algorithm allows sparse and delayed rewards to propagate efficiently through all transitions of a sampled episode. We evaluate our algorithm on 2D MNIST Maze environment and 49 games of the Atari 2600 environment and show that our method improves sample efficiency with a competitive amount of computational cost.

本文提出了具有直接价值传播能力的一种新型深度强化学习算法——Episodic Backward Update(EBU)。与传统方法通过经验重放的方式使用均匀随机采样不同，我们的算法通过采样整个回合并将状态值连续传递到前一状态。我们的递归算法实现了高效的计算，允许稀疏和延迟奖励直接通过所采样的全部转移。我们在理论上证明了EBU方法的收敛性，并在确定性和随机化环境下进行了实验。尤其是在Atari 2600领域的49个游戏中，EBU方法仅使用5%和10%的采样，就能实现与DQN相同的平均和中位数人类归一化性能。

通过分集反向更新实现高样本效率的深度强化学习