Evaluating policies using off-policy data is crucial for applying reinforcement learning to real-world problems such as healthcare and autonomous driving. Previous methods for off-policy evaluation (OPE) generally suffer from high variance or irreducible bias, leading to unacceptably high prediction errors. In this work, we introduce STAR, a framework for OPE that encompasses a broad range of estimators -- which include existing OPE methods as special cases -- that achieve lower mean squared prediction errors. STAR leverages state abstraction to distill complex, potentially continuous problems into compact, discrete models which we call abstract reward processes (ARPs). Predictions from ARPs estimated from off-policy data are provably consistent (asymptotically correct). Rather than proposing a specific estimator, we present a new framework for OPE and empirically demonstrate that estimators within STAR outperform existing methods. The best STAR estimator outperforms baselines in all twelve cases studied, and even the median STAR estimator surpasses the baselines in seven out of the twelve cases.

本研究解决了使用离政策数据进行策略评估中的高方差和不可减少偏差问题，导致预测误差过高。提出的STAR框架通过利用状态抽象将复杂问题转化为紧凑的离散模型，从而在多个情况下显著降低均方预测误差，是一种新的离政策评估方法。实证结果显示，STAR的估计量在所有研究的案例中均优于现有方法。

抽象奖励过程：利用状态抽象进行一致的离政策评估