We study high-confidence off-policy evaluation in the context of infinite-horizon Markov decision processes, where the objective is to establish a confidence interval (CI) for the target policy value using only offline data pre-collected from unknown behavior policies. This task faces two primary challenges: providing a comprehensive and rigorous error quantification in CI estimation, and addressing the distributional shift that results from discrepancies between the distribution induced by the target policy and the offline data-generating process. Motivated by an innovative unified error analysis, we jointly quantify the two sources of estimation errors: the misspecification error on modeling marginalized importance weights and the statistical uncertainty due to sampling, within a single interval. This unified framework reveals a previously hidden tradeoff between the errors, which undermines the tightness of the CI. Relying on a carefully designed discriminator function, the proposed estimator achieves a dual purpose: breaking the curse of the tradeoff to attain the tightest possible CI, and adapting the CI to ensure robustness against distributional shifts. Our method is applicable to time-dependent data without assuming any weak dependence conditions via leveraging a local supermartingale/martingale structure. Theoretically, we show that our algorithm is sample-efficient, error-robust, and provably convergent even in non-linear function approximation settings. The numerical performance of the proposed method is examined in synthetic datasets and an OhioT1DM mobile health study.

我们研究了基于无限时域马尔科夫决策过程的高置信度离策略评估，目标是仅使用预先收集的来自未知行为策略的离线数据建立目标策略值的置信区间。通过创新的统一误差分析，我们共同量化了建模边际化重要性权重的错误以及由抽样引起的统计不确定性这两个估计误差的来源，揭示了先前隐藏的错误权衡问题。通过精心设计的判别函数，我们提出的估计器既能打破错误权衡的限制以获得可能的最紧的置信区间，又能适应分布偏移以保证鲁棒性。我们的方法适用于时间相关的数据，不需要假设任何弱依赖条件，通过利用局部超值/鞅结构。在非线性函数近似设置中，理论上证明了我们的算法具有高效采样、错误鲁棒和可证收敛性。所提方法在合成数据集和OhioT1DM移动健康研究中得到了数值性能的验证。

分布偏移感知的离策略区间估计：一种统一的误差量化框架