This paper revisits the temporal difference (TD) learning algorithm for the
policy evaluation tasks in reinforcement learning. Typically, the performance
of TD(0) and TD($\lambda$) is very sensitive to the choice of stepsizes.
Oftentimes, TD(0) suffers from slow convergence. Motivated