The remarkable success of reinforcement learning (RL) heavily relies on
observing the reward of every visited state-action pair. In many real world
applications, however, an agent can observe only a score that represents the
quality of the whole trajectory, which is referred to as the