Since the earliest days of reinforcement learning, the workhorse method for
assigning credit to actions over time has been temporal-difference (TD)
learning, which propagates credit backward timestep-by-timestep. This approach
suffers when delays between actions and rewards are long an