We study Off-Policy Evaluation (OPE) in contextual bandit settings with large action spaces. The benchmark estimators suffer from severe bias and variance tradeoffs. Parametric approaches suffer from bias due to difficulty specifying the correct model, whereas ones with importance weight suffer from variance. To overcome these limitations, Marginalized Inverse Propensity Scoring (MIPS) was proposed to mitigate the estimator's variance via embeddings of an action. To make the estimator more accurate, we propose the doubly robust estimator of MIPS called the Marginalized Doubly Robust (MDR) estimator. Theoretical analysis shows that the proposed estimator is unbiased under weaker assumptions than MIPS while maintaining variance reduction against IPS, which was the main advantage of MIPS. The empirical experiment verifies the supremacy of MDR against existing estimators.

我们在具有大动作空间的情境赌博设置中研究了离策略评估 (Off-Policy Evaluation，OPE)。基准估计器在严重的偏差和方差权衡中遇到困难。为了克服这些限制，我们提出了基于动作嵌入(MIPS)的边际化逆向倾向评分(Marginalized Inverse Propensity Scoring, MIPS)来减少估计器的方差。为了使估计器更准确，我们提出了MIPS的双重稳健估计器(Marginalized Doubly Robust, MDR)。理论分析表明，所提出的估计器在比MIPS更弱的假设下是无偏的，同时保持对IPS的方差减少，这是MIPS的主要优势。经验实验证实了MDR对现有估计器的卓越性。

具有大行动空间的离策评估的双重稳健估计方法