We consider off-policy temporal-difference (TD) learning in discounted Markov
decision processes, where the goal is to evaluate a policy in a model-free way
by using observations of a state process generated without executing the
policy. To curb the high variance issue in off-policy TD learning, we propose a
new scheme of setting the $\lambda$-parameters of TD, based on generalized
Bellman equations. Our scheme is to set $\lambda$ according to the eligibility
trace iterates calculated in TD, thereby easily keeping these traces in a
desired bounded range. Compared with prior work, this scheme is more direct and
flexible, and allows much larger $\lambda$ values for off-policy TD learning
with bounded traces. As to its soundness, using Markov chain theory, we prove
the ergodicity of the joint state-trace process under nonrestrictive
conditions, and we show that associated with our scheme is a generalized
Bellman equation (for the policy to be evaluated) that depends on both the
evolution of $\lambda$ and the unique invariant probability measure of the
state-trace process. These results not only lead immediately to a
characterization of the convergence behavior of least-squares based
implementation of our scheme, but also prepare the ground for further analysis
of gradient-based implementations.

该论文研究了非政策时间差异学习在折扣马尔可夫决策过程中的应用，提出了一种新的基于广义 Bellman 方程设置 λ- 参数的方案来控制偏差，通过马尔科夫链理论证明了该方案的收敛性并分析了其在最小二乘实现中的收敛性。