Temporal difference (TD) learning is a popular algorithm for policy
evaluation in reinforcement learning, but the vanilla TD can substantially
suffer from the inherent optimization variance. A variance reduced TD (VRTD)
algorithm was proposed by Korda and La (2015), which applies the v