We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.

本研究研究了结合尾平均的时序差分（TD）学习算法的有限时间行为。研究发现，尾平均的TD在不需要信息的情况下，可以在期望和高概率下以最优的$O(1/t)$速率收敛，我们提出和分析了一个增加了正则化的TD变量，结论表明正则化的TD对于具有病态特征的问题是有用的。

基于线性函数逼近的时序差分学习的有限时间分析：尾平均和正则化