Integral to recent successes in deep reinforcement learning has been a class
of temporal difference methods that use infrequently updated target values for
policy evaluation in a Markov Decision Process. Yet a complete theoretical
explanation for the effectiveness of target networks remains elusive. In this
work, we provide an analysis of this popular class of algorithms, to finally
answer the question: `why do target networks stabilise TD learning'? To do so,
we formalise the notion of a partially fitted policy evaluation method, which
describes the use of target networks and bridges the gap between fitted methods
and semigradient temporal difference algorithms. Using this framework we are
able to uniquely characterise the so-called deadly triad - the use of TD
updates with (nonlinear) function approximation and off-policy data - which
often leads to nonconvergent algorithms. This insight leads us to conclude that
the use of target networks can mitigate the effects of poor conditioning in the
Jacobian of the TD update. Instead, we show that under mild regularity
conditions and a well tuned target network update frequency, convergence can be
guaranteed even in the extremely challenging off-policy sampling and nonlinear
function approximation setting.

本研究提供了对深度强化学习中关于目标网络的理论解释，通过对拟合部分策略估计方法的形式化定义，解释了目标网络为何可以稳定 TD 学习，并阐述了它的优缺点和在极具挑战性的离线采样和非线性函数逼近设置中保证收敛的条件。

目标网络如何稳定时序差分法

Why Target Networks Stabilise Temporal Difference Methods

The $Q$-learning algorithm is a simple and widely-used stochastic
approximation scheme for reinforcement learning, but the basic protocol can
exhibit instability in conjunction with function approximation. Such
instability can be observed even with linear function approximation. In
practice, tools such as target networks and experience replay appear to be
essential, but the individual contribution of each of these mechanisms is not
well understood theoretically. This work proposes an exploration variant of the
basic $Q$-learning protocol with linear function approximation. Our modular
analysis illustrates the role played by each algorithmic tool that we adopt: a
second order update rule, a set of target networks, and a mechanism akin to
experience replay. Together, they enable state of the art regret bounds on
linear MDPs while preserving the most prominent feature of the algorithm,
namely a space complexity independent of the number of step elapsed. We show
that the performance of the algorithm degrades very gracefully under a novel
and more permissive notion of approximation error. The algorithm also exhibits
a form of instance-dependence, in that its performance depends on the
"effective" feature dimension.

本文讨论了 $Q$-learning 算法的不稳定性问题，提出了一种基于探索的改进方案。该算法通过结合二阶更新，目标网络等机制，实现了线性 MDPs 的最新遗憾界限，并且算法设计独立于时间步长。此外，该算法表现出一定的实例依赖性，并且在近似误差更为宽松的条件下的性能下降比较缓慢。

使用线性结构稳定 Q 学习，以实现证明有效的学习

Stabilizing Q-learning with Linear Architectures for Provably Efficient  Learning

The use of target networks has been a popular and key component of recent
deep Q-learning algorithms for reinforcement learning, yet little is known from
the theory side. In this work, we introduce a new family of target-based
temporal difference (TD) learning algorithms and provide theoretical analysis
on their convergences. In contrast to the standard TD-learning, target-based TD
algorithms maintain two separate learning parameters-the target variable and
online variable. Particularly, we introduce three members in the family, called
the averaging TD, double TD, and periodic TD, where the target variable is
updated through an averaging, symmetric, or periodic fashion, mirroring those
techniques used in deep Q-learning practice.
We establish asymptotic convergence analyses for both averaging TD and double
TD and a finite sample analysis for periodic TD. In addition, we also provide
some simulation results showing potentially superior convergence of these
target-based TD algorithms compared to the standard TD-learning. While this
work focuses on linear function approximation and policy evaluation setting, we
consider this as a meaningful step towards the theoretical understanding of
deep Q-learning variants with target networks.

本文介绍了一种新的基于目标的时间差分（TD）学习算法，并对其收敛性进行了理论分析，该算法与标准的 TD 学习不同，维护两个独立的学习参数 - 目标变量和在线变量，以加速 Deep Q 学习中目标网络的收敛。