Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new policies using existing data without costly experimentation. However, current OPE methods, such as Inverse Probability Weighting (IPW) and Doubly Robust (DR) estimators, suffer from high variance, particularly in cases of low overlap between target and behavior policies or large action and context spaces. In this paper, we introduce a new OPE estimator for contextual bandits, the Marginal Ratio (MR) estimator, which focuses on the shift in the marginal distribution of outcomes $Y$ instead of the policies themselves. Through rigorous theoretical analysis, we demonstrate the benefits of the MR estimator compared to conventional methods like IPW and DR in terms of variance reduction. Additionally, we establish a connection between the MR estimator and the state-of-the-art Marginalized Inverse Propensity Score (MIPS) estimator, proving that MR achieves lower variance among a generalized family of MIPS estimators. We further illustrate the utility of the MR estimator in causal inference settings, where it exhibits enhanced performance in estimating Average Treatment Effects (ATE). Our experiments on synthetic and real-world datasets corroborate our theoretical findings and highlight the practical advantages of the MR estimator in OPE for contextual bandits.

在本文中，我们介绍了一种新的基于边际比率的 Off-Policy Evaluation (OPE) 估计器，用于 contextual bandits，旨在通过关注结果边际分布的变化来减少方差。我们通过严格的理论分析证明了 MR 估计器相对于传统方法（如 IPW 和 DR）在方差减小方面的优势。此外，我们还验证了 MR 估计器与最先进的 Marginalized Inverse Propensity Score (MIPS) 估计器之间的联系，并证明 MR 在广义 MIPS 估计器家族中具有更低的方差。我们的实验结果在合成数据集和真实世界数据集上验证了我们的理论发现，并突出了 MR 估计器在 contextual bandits 的 OPE 中的实际优势，特别是在因果推断设置中对于估计平均处理效应方面的性能提升。

在情境强化学习中进行的离线策略评估的边际密度比