BriefGPT.xyz
Jun, 2018
RUDDER: 延迟奖励的返回分解
RUDDER: Return Decomposition for Delayed Rewards
HTML
PDF
Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Sepp Hochreiter
TL;DR
提出 RUDDER 方法来解决在马尔可夫决策过程中有延迟奖励的情况下,通过奖励重新分配实现把预期的未来奖励推向零,简化 Q 值的估计,并通过在人工任务上的实验验证其在 Atai 游戏中有明显的提高。
Abstract
We propose a novel
reinforcement learning
approach for finite
markov decision processes
(MDPs) with
delayed rewards
. In this work, biases
→