We study the finite-time behaviour of the popular temporal difference (TD)
learning algorithm when combined with tail-averaging. We derive finite time
bounds on the parameter error of the tail-averaged TD iterate under a step-size
choice that does not require information about the eigenvalues of the matrix
underlying the projected TD fixed point. Our analysis shows that tail-averaged
TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and
with high probability. In addition, our bounds exhibit a sharper rate of decay
for the initial error (bias), which is an improvement over averaging all
iterates. We also propose and analyse a variant of TD that incorporates
regularisation. From analysis, we conclude that the regularised version of TD
is useful for problems with ill-conditioned features.

本研究研究了结合尾平均的时序差分（TD）学习算法的有限时间行为。研究发现，尾平均的 TD 在不需要信息的情况下，可以在期望和高概率下以最优的 $O (1/t)$ 速率收敛，我们提出和分析了一个增加了正则化的 TD 变量，结论表明正则化的 TD 对于具有病态特征的问题是有用的。

基于线性函数逼近的时序差分学习的有限时间分析：尾平均和正则化

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

This work characterizes the benefits of averaging schemes widely used in
conjunction with stochastic gradient descent (SGD). In particular, this work
provides a sharp analysis of: (1) mini-batching, a method of averaging many
samples of a stochastic gradient to both reduce the variance of the stochastic
gradient estimate and for parallelizing SGD and (2) tail-averaging, a method
involving averaging the final few iterates of SGD to decrease the variance in
SGD's final iterate. This work presents non-asymptotic excess risk bounds for
these schemes for the stochastic approximation problem of least squares
regression.
Furthermore, this work establishes a precise problem-dependent extent to
which mini-batch SGD yields provable near-linear parallelization speedups over
SGD with batch size one. This allows for understanding learning rate versus
batch size tradeoffs for the final iterate of an SGD method. These results are
then utilized in providing a highly parallelizable SGD method that obtains the
minimax risk with nearly the same number of serial updates as batch gradient
descent, improving significantly over existing SGD methods. A non-asymptotic
analysis of communication efficient parallelization schemes such as
model-averaging/parameter mixing methods is then provided.
Finally, this work sheds light on some fundamental differences in SGD's
behavior when dealing with agnostic noise in the (non-realizable) least squares
regression problem. In particular, the work shows that the stepsizes that
ensure minimax risk for the agnostic case must be a function of the noise
properties.
This paper builds on the operator view of analyzing SGD methods, introduced
by Defossez and Bach (2015), followed by developing a novel analysis in
bounding these operators to characterize the excess risk. These techniques are
of broader interest in analyzing computational aspects of stochastic
approximation.

该研究探讨了在随机梯度下降中广泛使用的平均方案的好处。特别是，通过对最小二乘回归的随机逼近问题进行非渐进超额风险分析，提供了这些方案的性能保证，并提出了高度可并行化的随机梯度下降方法。同时，该研究认为，为了保证最小极大风险，针对混浊噪声的步长必须是噪声属性的一个函数。