We introduce a novel doubly-robust (DR) off-policy evaluation (OPE) estimator
for Markov decision processes, DRUnknown, designed for situations where both
the logging policy and the value function are unknown. The proposed estimator
initially estimates the logging policy and then estimates the value function
model by minimizing the asymptotic variance of the estimator while considering
the estimating effect of the logging policy. When the logging policy model is
correctly specified, DRUnknown achieves the smallest asymptotic variance within
the class containing existing OPE estimators. When the value function model is
also correctly specified, DRUnknown is optimal as its asymptotic variance
reaches the semiparametric lower bound. We present experimental results
conducted in contextual bandits and reinforcement learning to compare the
performance of DRUnknown with that of existing methods.

该研究介绍了一种新的双鲁棒离策评估（OPE）估计器，用于未知的日志策略和价值函数情况下，能估计产生半参数下界的最小渐近方差。

估计记录策略的双重稳健离线策略评估

Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy

Off-policy learning, referring to the procedure of policy optimization with
access only to logged feedback data, has shown importance in various real-world
applications, such as search engines, recommender systems, and etc. While the
ground-truth logging policy, which generates the logged data, is usually
unknown, previous work simply takes its estimated value in off-policy learning,
ignoring both high bias and high variance resulted from such an estimator,
especially on samples with small and inaccurately estimated logging
probabilities. In this work, we explicitly model the uncertainty in the
estimated logging policy and propose a Uncertainty-aware Inverse Propensity
Score estimator (UIPS) for improved off-policy learning. Experiment results on
synthetic and three real-world recommendation datasets demonstrate the
advantageous sample efficiency of the proposed UIPS estimator against an
extensive list of state-of-the-art baselines.

本研究通过显式建模不确定性，并提出一种不确定性感知的倾向得分估计器（UIPS），可改进离线策略优化，实验结果表明其比现有方法更具有样本效益。