We consider off-policy temporal-difference (TD) learning methods for policy
evaluation in Markov decision processes with finite spaces and discounted
reward criteria, and we present a collection of convergence results for several
gradient-based TD algorithms with linear function approximation. The algorithms
we analyze include: (i) two basic forms of two-time-scale gradient-based TD
algorithms, which we call GTD and which minimize the mean squared projected
Bellman error using stochastic gradient-descent; (ii) their "robustified"
biased variants; (iii) their mirror-descent versions which combine the
mirror-descent idea with TD learning; and (iv) a single-time-scale version of
GTD that solves minimax problems formulated for approximate policy evaluation.
We derive convergence results for three types of stepsizes: constant
stepsize, slowly diminishing stepsize, as well as the standard type of
diminishing stepsize with a square-summable condition. For the first two types
of stepsizes, we apply the weak convergence method from stochastic
approximation theory to characterize the asymptotic behavior of the algorithms,
and for the standard type of stepsize, we analyze the algorithmic behavior with
respect to a stronger mode of convergence, almost sure convergence. Our
convergence results are for the aforementioned TD algorithms with three general
ways of setting their $\lambda$-parameters: (i) state-dependent $\lambda$; (ii)
a recently proposed scheme of using history-dependent $\lambda$ to keep the
eligibility traces of the algorithms bounded while allowing for relatively
large values of $\lambda$; and (iii) a composite scheme of setting the
$\lambda$-parameters that combines the preceding two schemes and allows a
broader class of generalized Bellman operators to be used for approximate
policy evaluation with TD methods.

本文考虑了有限状态和折扣回报标准下的马尔科夫决策过程策略评估问题中的离策略时间差分 (TD) 学习方法，并针对几个基于梯度的 TD 算法提出了一组收敛性结果。

关于某些基于梯度的时间差分离线学习算法的收敛性

On Convergence of some Gradient-based Temporal-Differences Algorithms  for Off-Policy Learning

We consider off-policy temporal-difference (TD) learning in discounted Markov
decision processes, where the goal is to evaluate a policy in a model-free way
by using observations of a state process generated without executing the
policy. To curb the high variance issue in off-policy TD learning, we propose a
new scheme of setting the $\lambda$-parameters of TD, based on generalized
Bellman equations. Our scheme is to set $\lambda$ according to the eligibility
trace iterates calculated in TD, thereby easily keeping these traces in a
desired bounded range. Compared with prior work, this scheme is more direct and
flexible, and allows much larger $\lambda$ values for off-policy TD learning
with bounded traces. As to its soundness, using Markov chain theory, we prove
the ergodicity of the joint state-trace process under nonrestrictive
conditions, and we show that associated with our scheme is a generalized
Bellman equation (for the policy to be evaluated) that depends on both the
evolution of $\lambda$ and the unique invariant probability measure of the
state-trace process. These results not only lead immediately to a
characterization of the convergence behavior of least-squares based
implementation of our scheme, but also prepare the ground for further analysis
of gradient-based implementations.

该论文研究了非政策时间差异学习在折扣马尔可夫决策过程中的应用，提出了一种新的基于广义 Bellman 方程设置 λ- 参数的方案来控制偏差，通过马尔科夫链理论证明了该方案的收敛性并分析了其在最小二乘实现中的收敛性。