Off-policy learning, referring to the procedure of policy optimization with
access only to logged feedback data, has shown importance in various real-world
applications, such as search engines, recommender systems, and etc. While the
ground-truth logging policy, which generates the logged data, is usually
unknown, previous work simply takes its estimated value in off-policy learning,
ignoring both high bias and high variance resulted from such an estimator,
especially on samples with small and inaccurately estimated logging
probabilities. In this work, we explicitly model the uncertainty in the
estimated logging policy and propose a Uncertainty-aware Inverse Propensity
Score estimator (UIPS) for improved off-policy learning. Experiment results on
synthetic and three real-world recommendation datasets demonstrate the
advantageous sample efficiency of the proposed UIPS estimator against an
extensive list of state-of-the-art baselines.

本研究通过显式建模不确定性，并提出一种不确定性感知的倾向得分估计器（UIPS），可改进离线策略优化，实验结果表明其比现有方法更具有样本效益。

不确定性感知的离线学习

Uncertainty-Aware Off-Policy Learning

Counterfactual risk minimization is a framework for offline policy
optimization with logged data which consists of context, action, propensity
score, and reward for each sample point. In this work, we build on this
framework and propose a learning method for settings where the rewards for some
samples are not observed, and so the logged data consists of a subset of
samples with unknown rewards and a subset of samples with known rewards. This
setting arises in many application domains, including advertising and
healthcare. While reward feedback is missing for some samples, it is possible
to leverage the unknown-reward samples in order to minimize the risk, and we
refer to this setting as semi-counterfactual risk minimization. To approach
this kind of learning problem, we derive new upper bounds on the true risk
under the inverse propensity score estimator. We then build upon these bounds
to propose a regularized counterfactual risk minimization method, where the
regularization term is based on the logged unknown-rewards dataset only; hence
it is reward-independent. We also propose another algorithm based on generating
pseudo-rewards for the logged unknown-rewards dataset. Experimental results
with neural networks and benchmark datasets indicate that these algorithms can
leverage the logged unknown-rewards dataset besides the logged known-reward
dataset.

提出了基于反事实风险最小化和反向倾向得分估计器的方法以最小化风险，尝试解决某些样本的奖励反馈缺失的问题，并针对该问题提出了正则化的反事实风险最小化算法和基于生成伪奖励的算法。

神经网络半反事实风险最小化

Semi-Counterfactual Risk Minimization Via Neural Networks

Accurately evaluating new policies (e.g. ad-placement models, ranking
functions, recommendation functions) is one of the key prerequisites for
improving interactive systems. While the conventional approach to evaluation
relies on online A/B tests, recent work has shown that counterfactual
estimators can provide an inexpensive and fast alternative, since they can be
applied offline using log data that was collected from a different policy
fielded in the past. In this paper, we address the question of how to estimate
the performance of a new target policy when we have log data from multiple
historic policies. This question is of great relevance in practice, since
policies get updated frequently in most online systems. We show that naively
combining data from multiple logging policies can be highly suboptimal. In
particular, we find that the standard Inverse Propensity Score (IPS) estimator
suffers especially when logging and target policies diverge -- to a point where
throwing away data improves the variance of the estimator. We therefore propose
two alternative estimators which we characterize theoretically and compare
experimentally. We find that the new estimators can provide substantially
improved estimation accuracy.

本文研究了如何利用历史数据来预测目标策略的性能，并提出了两种替代方法，相比于传统方法，能够更准确地评估交互式系统的新政策。