This paper analyzes multi-step TD-learning algorithms within the `deadly
triad' scenario, characterized by linear function approximation, off-policy
learning, and bootstrapping. In particular, we prove that n-step TD-learning
algorithms converge to a solution as the sampling horizon n increases
sufficiently. The paper is divided into two parts. In the first part, we
comprehensively examine the fundamental properties of their model-based
deterministic counterparts, including projected value iteration, gradient
descent algorithms, and the control theoretic approach, which can be viewed as
prototype deterministic algorithms whose analysis plays a pivotal role in
understanding and developing their model-free reinforcement learning
counterparts. In particular, we prove that these algorithms converge to
meaningful solutions when n is sufficiently large. Based on these findings, two
n-step TD-learning algorithms are proposed and analyzed, which can be seen as
the model-free reinforcement learning counterparts of the gradient and control
theoretic algorithms.

本文分析了在线性函数逼近、离策略学习和自举的 “致命三角” 场景中的多步 TD 学习算法，并证明了当采样周期 n 足够大时，n 步 TD 学习算法收敛到一个解。基于这些发现，提出并分析了两种 n 步 TD 学习算法，这些算法可以视为梯度和控制理论算法的无模型强化学习对应物。