Offline policy evaluation (OPE) allows us to evaluate and estimate a new
sequential decision-making policy's performance by leveraging historical
interaction data collected from other policies. Evaluating a new policy online
without a confident estimate of its performance can lead to costly, unsafe, or
hazardous outcomes, especially in education and healthcare. Several OPE
estimators have been proposed in the last decade, many of which have
hyperparameters and require training. Unfortunately, choosing the best OPE
algorithm for each task and domain is still unclear. In this paper, we propose
a new algorithm that adaptively blends a set of OPE estimators given a dataset
without relying on an explicit selection using a statistical procedure. We
prove that our estimator is consistent and satisfies several desirable
properties for policy evaluation. Additionally, we demonstrate that when
compared to alternative approaches, our estimator can be used to select
higher-performing policies in healthcare and robotics. Our work contributes to
improving ease of use for a general-purpose, estimator-agnostic, off-policy
evaluation framework for offline RL.

提出了一个新的、自适应混合使用一组离线策略评估器的算法，该算法不依赖于显式选择，并证明了该评估器对政策评估具有一致性和几个可取的属性。此外，还证明了与其他方法相比，该评估器可以在医疗保健和机器人技术方面选择更高性能的策略，为离线强化学习中的通用、估计器不可知的离线策略评估框架的易用性改进做出了贡献。

OPERA：多个估计器加权汇总的离线自动策略评估

OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates  of Multiple Estimators

This paper introduces SCOPE-RL, a comprehensive open-source Python software
designed for offline reinforcement learning (offline RL), off-policy evaluation
(OPE), and selection (OPS). Unlike most existing libraries that focus solely on
either policy learning or evaluation, SCOPE-RL seamlessly integrates these two
key aspects, facilitating flexible and complete implementations of both offline
RL and OPE processes. SCOPE-RL put particular emphasis on its OPE modules,
offering a range of OPE estimators and robust evaluation-of-OPE protocols. This
approach enables more in-depth and reliable OPE compared to other packages. For
instance, SCOPE-RL enhances OPE by estimating the entire reward distribution
under a policy rather than its mere point-wise expected value. Additionally,
SCOPE-RL provides a more thorough evaluation-of-OPE by presenting the
risk-return tradeoff in OPE results, extending beyond mere accuracy evaluations
in existing OPE literature. SCOPE-RL is designed with user accessibility in
mind. Its user-friendly APIs, comprehensive documentation, and a variety of
easy-to-follow examples assist researchers and practitioners in efficiently
implementing and experimenting with various offline RL methods and OPE
estimators, tailored to their specific problem contexts. The documentation of
SCOPE-RL is available at this https URL

SCOPE-RL 是一款面向离线强化学习（offline RL）、离策略评估（off-policy evaluation）和选择（selection）的全面开源 Python 软件，通过集成政策学习和评估等两个关键方面，提供了灵活和完整的离线 RL 和 OPE 过程的实现，其中特别注重 OPE 模块，提供了一系列 OPE 估计器和鲁棒的 OPE 评估协议。