Off-policy learning plays a pivotal role in optimizing and evaluating
policies prior to the online deployment. However, during the real-time serving,
we observe varieties of interventions and constraints that cause inconsistency
between the online and offline settings, which we summarize and term as runtime
uncertainty. Such uncertainty cannot be learned from the logged data due to its
abnormality and rareness nature. To assert a certain level of robustness, we
perturb the off-policy estimators along an adversarial direction in view of the
runtime uncertainty. It allows the resulting estimators to be robust not only
to observed but also unexpected runtime uncertainties. Leveraging this idea, we
bring runtime-uncertainty robustness to three major off-policy learning
methods: the inverse propensity score method, reward-model method, and doubly
robust method. We theoretically justify the robustness of our methods to
runtime uncertainty, and demonstrate their effectiveness using both the
simulation and the real-world online experiments.

论文提出了一种针对运行时不确定性的离线评估方法，该方法允许所得的估算器不仅对预期中的运行时不确定性具有鲁棒性，还对观察到的和意外的运行时不确定性具有鲁棒性，并且有效地证明其在仿真和现实世界在线实验中的鲁棒性。