Off-policy evaluation (OPE) is crucial for evaluating a target policy's impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging.This paper studies state abstractions-originally designed for policy learning-in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstractions for OPE. (ii) We derive sufficient conditions for achieving irrelevance in Q-functions and marginalized importance sampling ratios, the latter obtained by constructing a time-reversed Markov decision process (MDP) based on the observed MDP. (iii) We propose a novel two-step procedure that sequentially projects the original state space into a smaller space, which substantially simplify the sample complexity of OPE arising from high cardinality.

本研究旨在通过使用状态抽象来对关联性评估进行有效的离线算法评估，并通过构建基于观察到的MDP的时间反转MDP导出Q函数和边缘化重要性采样比率的充分条件，进而提出一种新颖的两步骤程序，将原始状态空间顺序投影到较小的空间，从而大大简化高基数引起的关联性评估的样本复杂度。

正向和反向状态抽象用于策略离线评估