We propose and analyze an alternate approach to off-policy multi-step
temporal difference learning, in which off-policy returns are corrected with
the current q-function in terms of rewards, rather than with the target policy
in terms of transition probabilities. We prove that such app