We consider the off-policy evaluation problem of reinforcement learning using deep neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high representation dimensionality. Specifically, we establish a sharp error bound for the fitted Q-evaluation that depends on the intrinsic low dimension, the smoothness of the state-action space, and a function class-restricted $\chi^2$-divergence. It is noteworthy that the restricted $\chi^2$-divergence measures the behavior and target policies' {\it mismatch in the function space}, which can be small even if the two policies are not close to each other in their tabular forms. Numerical experiments are provided to support our theoretical analysis.

该研究考虑使用深度卷积神经网络对强化学习的离线策略评估问题进行分析，发现通过适当选择网络大小，可以利用马尔科夫决策过程中的任何低维流形结构，获得一个高效的估计器。同时，该研究还提出一种新的逼近算法，并在数值实验中验证理论分析。

使用深度网络对低维流形上的非参数离策略评估进行样本复杂度分析