Offline policy evaluation (OPE) allows us to evaluate and estimate a new
sequential decision-making policy's performance by leveraging historical
interaction data collected from other policies. Evaluating a new policy online
without a confident estimate of its performance can lead to costly, unsafe, or
hazardous outcomes, especially in education and healthcare. Several OPE
estimators have been proposed in the last decade, many of which have
hyperparameters and require training. Unfortunately, choosing the best OPE
algorithm for each task and domain is still unclear. In this paper, we propose
a new algorithm that adaptively blends a set of OPE estimators given a dataset
without relying on an explicit selection using a statistical procedure. We
prove that our estimator is consistent and satisfies several desirable
properties for policy evaluation. Additionally, we demonstrate that when
compared to alternative approaches, our estimator can be used to select
higher-performing policies in healthcare and robotics. Our work contributes to
improving ease of use for a general-purpose, estimator-agnostic, off-policy
evaluation framework for offline RL.

提出了一个新的、自适应混合使用一组离线策略评估器的算法，该算法不依赖于显式选择，并证明了该评估器对政策评估具有一致性和几个可取的属性。此外，还证明了与其他方法相比，该评估器可以在医疗保健和机器人技术方面选择更高性能的策略，为离线强化学习中的通用、估计器不可知的离线策略评估框架的易用性改进做出了贡献。