The success of popular algorithms for deep reinforcement learning, such as
policy-gradients and Q-learning, relies heavily on the availability of an
informative reward signal at each timestep of the sequential decision-making
process. When rewards are only sparsely available during an