The use of target networks has been a popular and key component of recent
deep Q-learning algorithms for reinforcement learning, yet little is known from
the theory side. In this work, we introduce a new family of target-based
temporal difference (TD) learning algorithms and provide theoretical analysis
on their convergences. In contrast to the standard TD-learning, target-based TD
algorithms maintain two separate learning parameters-the target variable and
online variable. Particularly, we introduce three members in the family, called
the averaging TD, double TD, and periodic TD, where the target variable is
updated through an averaging, symmetric, or periodic fashion, mirroring those
techniques used in deep Q-learning practice.
We establish asymptotic convergence analyses for both averaging TD and double
TD and a finite sample analysis for periodic TD. In addition, we also provide
some simulation results showing potentially superior convergence of these
target-based TD algorithms compared to the standard TD-learning. While this
work focuses on linear function approximation and policy evaluation setting, we
consider this as a meaningful step towards the theoretical understanding of
deep Q-learning variants with target networks.

本文介绍了一种新的基于目标的时间差分（TD）学习算法，并对其收敛性进行了理论分析，该算法与标准的 TD 学习不同，维护两个独立的学习参数 - 目标变量和在线变量，以加速 Deep Q 学习中目标网络的收敛。

基于目标的时序差分学习

Target-Based Temporal Difference Learning

We consider the dynamics of a linear stochastic approximation algorithm
driven by Markovian noise, and derive finite-time bounds on the moments of the
error, i.e., deviation of the output of the algorithm from the equilibrium
point of an associated ordinary differential equation (ODE). We obtain
finite-time bounds on the mean-square error in the case of constant step-size
algorithms by considering the drift of an appropriately chosen Lyapunov
function. The Lyapunov function can be interpreted either in terms of Stein's
method to obtain bounds on steady-state performance or in terms of Lyapunov
stability theory for linear ODEs. We also provide a comprehensive treatment of
the moments of the square of the 2-norm of the approximation error. Our
analysis yields the following results: (i) for a given step-size, we show that
the lower-order moments can be made small as a function of the step-size and
can be upper-bounded by the moments of a Gaussian random variable; (ii) we show
that the higher-order moments beyond a threshold may be infinite in
steady-state; and (iii) we characterize the number of samples needed for the
finite-time bounds to be of the same order as the steady-state bounds. As a
by-product of our analysis, we also solve the open problem of obtaining
finite-time bounds for the performance of temporal difference learning
algorithms with linear function approximation and a constant step-size, without
requiring a projection step or an i.i.d. noise assumption.

考虑由 Markovian 噪声驱动的线性随机逼近算法的动态特性，通过考虑适当选择的 Lyapunov 函数的漂移，获得常数步长算法的有限时间误差的二次矩的有限时间界限。我们还对逼近误差 2 范数的平方的矩进行了全面的处理。