This paper is concerned with the problem of policy evaluation with linear
function approximation in discounted infinite horizon Markov decision
processes. We investigate the sample complexities required to guarantee a
predefined estimation error of the best linear coefficients for two widely-used
policy evaluation algorithms: the temporal difference (TD) learning algorithm
and the two-timescale linear TD with gradient correction (TDC) algorithm. In
both the on-policy setting, where observations are generated from the target
policy, and the off-policy setting, where samples are drawn from a behavior
policy potentially different from the target policy, we establish the first
sample complexity bound with high-probability convergence guarantee that
attains the optimal dependence on the tolerance level. We also exhihit an
explicit dependence on problem-related quantities, and show in the on-policy
setting that our upper bound matches the minimax lower bound on crucial problem
parameters, including the choice of the feature maps and the problem dimension.

本文主要针对利用线性函数逼似模型来评估折扣无限领域 MDP 中的策略的问题，研究两种广泛使用的政策评估算法（TD 和 TDC）最佳线性系数的预估误差所需的样本复杂度，提出了一个高可靠性收敛保证的样本复杂度上界，并且在策略内和策略外设置中都达到了最优容差级别依赖，同时，通过显示与问题相关的量，表明在策略内设置中，我们的上界与关键问题参数的 Minimax 下界相匹配，包括特征映射的选择和问题维数。

使用线性函数逼近进行策略评估的高概率样本复杂度

Sharp high-probability sample complexities for policy evaluation with  linear function approximation

Robust reinforcement learning (RL) is to find a policy that optimizes the
worst-case performance over an uncertainty set of MDPs. In this paper, we focus
on model-free robust RL, where the uncertainty set is defined to be centering
at a misspecified MDP that generates a single sample trajectory sequentially
and is assumed to be unknown. We develop a sample-based approach to estimate
the unknown uncertainty set and design a robust Q-learning algorithm (tabular
case) and robust TDC algorithm (function approximation setting), which can be
implemented in an online and incremental fashion. For the robust Q-learning
algorithm, we prove that it converges to the optimal robust Q function, and for
the robust TDC algorithm, we prove that it converges asymptotically to some
stationary points. Unlike the results in [Roy et al., 2017], our algorithms do
not need any additional conditions on the discount factor to guarantee the
convergence. We further characterize the finite-time error bounds of the two
algorithms and show that both the robust Q-learning and robust TDC algorithms
converge as fast as their vanilla counterparts(within a constant factor). Our
numerical experiments further demonstrate the robustness of our algorithms. Our
approach can be readily extended to robustify many other algorithms, e.g., TD,
SARSA, and other GTD algorithms.

本文提出了一种基于样本的方法来估计未知的不确定性集并设计了一种鲁棒 Q 学习算法和鲁棒 TDC 算法，可以在线上和增量的情况下实现，在不需要收敛性保证的情况下证明了 Q 学习算法收敛到最优的鲁棒 Q 函数，并证明了 TDC 算法渐近收敛到一些稳定点，在数值实验中进一步验证了算法的鲁棒性。

带模型不确定性的在线强化学习

Online Robust Reinforcement Learning with Model Uncertainty

Gradient-based temporal difference (GTD) algorithms are widely used in
off-policy learning scenarios. Among them, the two time-scale TD with gradient
correction (TDC) algorithm has been shown to have superior performance. In
contrast to previous studies that characterized the non-asymptotic convergence
rate of TDC only under identical and independently distributed (i.i.d.) data
samples, we provide the first non-asymptotic convergence analysis for two
time-scale TDC under a non-i.i.d.\ Markovian sample path and linear function
approximation. We show that the two time-scale TDC can converge as fast as
O(log t/(t^(2/3))) under diminishing stepsize, and can converge exponentially
fast under constant stepsize, but at the cost of a non-vanishing error. We
further propose a TDC algorithm with blockwisely diminishing stepsize, and show
that it asymptotically converges with an arbitrarily small error at a
blockwisely linear convergence rate. Our experiments demonstrate that such an
algorithm converges as fast as TDC under constant stepsize, and still enjoys
comparable accuracy as TDC under diminishing stepsize.

本文对两时间尺度 TDC 算法在非独立同分布的马尔可夫抽样路径和线性函数逼近下的收敛性进行了非渐近收敛分析，并在此基础上提出了具有分块减小的步长的 TDC 算法，实验结果表明其具有与 TDC 常数步长收敛速度相当的收敛速度，并在减小步长的情况下仍保持与 TDC 相当的精度。