We study the statistical limits of uniform convergence for offline policy evaluation (OPE) problems (uniform OPE for short) with model-based methods under episodic MDP setting. Uniform OPE $\sup_\Pi|Q^\pi-\hat{Q}^\pi|<\epsilon$ (initiated by Yin et al. 2021) is a stronger measure than the point-wise (fixed policy) OPE and ensures offline policy learning when $\Pi$ contains all policies (we call it global policy class). In this paper, we establish an $\Omega(H^2 S/d_m\epsilon^2)$ lower bound (over model-based family) for the global uniform OPE, where $d_m$ is the minimal state-action distribution induced by the behavior policy. The order $S/d_m\epsilon^2$ reveals global uniform OPE task is intrinsically harder than offline policy learning due to the extra $S$ factor. Next, our main result establishes an episode complexity of $\tilde{O}(H^2/d_m\epsilon^2)$ for \emph{local} uniform convergence that applies to all \emph{near-empirically optimal} policies for the MDPs with \emph{stationary} transition. The result implies the optimal sample complexity for offline learning and separates local uniform OPE from the global case. Paramountly, the model-based method combining with our new analysis technique (singleton absorbing MDP) can be adapted to the new settings: offline task-agnostic and the offline reward-free with optimal complexity $\tilde{O}(H^2\log(K)/d_m\epsilon^2)$ ($K$ is the number of tasks) and $\tilde{O}(H^2S/d_m\epsilon^2)$ respectively, which provides a unified framework for simultaneously solving different offline RL problems.

本研究利用基于模型的方法研究了离线策略评估问题的统一框架，对于一些有充分理论支持的离线任务提供了最优学习方案，研究了统一收敛的统计上限，并在局部统一收敛方面建立了统一高效的分析工具。

在时间均匀、无奖励、任务不可知的情况下进行最优均匀 OPE 和基于模型的离线强化学习