We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.

本研究旨在通过使用值函数的方差信息提高离线策略评估中的样本效率，其中针对非时变线性马尔可夫决策过程（MDPs），提出了VA-OPE算法，使用值函数的方差对Fitted Q-Iteration中的Bellman残差进行重新加权，并且我们展示了我们的算法比最好已知的结果实现了更紧密的误差界限。我们对行为策略和目标策略之间的分布变化进行了细致的描述，而广泛的数值实验也支持了我们的理论。

线性函数逼近下的方差感知离线评估