This paper is concerned with the problem of policy evaluation with linear function approximation in discounted infinite horizon Markov decision processes. We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms: the temporal difference (TD) learning algorithm and the two-timescale linear TD with gradient correction (TDC) algorithm. In both the on-policy setting, where observations are generated from the target policy, and the off-policy setting, where samples are drawn from a behavior policy potentially different from the target policy, we establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level. We also exhihit an explicit dependence on problem-related quantities, and show in the on-policy setting that our upper bound matches the minimax lower bound on crucial problem parameters, including the choice of the feature maps and the problem dimension.

本文主要针对利用线性函数逼似模型来评估折扣无限领域MDP中的策略的问题，研究两种广泛使用的政策评估算法（TD和TDC）最佳线性系数的预估误差所需的样本复杂度，提出了一个高可靠性收敛保证的样本复杂度上界，并且在策略内和策略外设置中都达到了最优容差级别依赖，同时，通过显示与问题相关的量，表明在策略内设置中，我们的上界与关键问题参数的Minimax下界相匹配，包括特征映射的选择和问题维数。

使用线性函数逼近进行策略评估的高概率样本复杂度