This paper addresses the problem of policy selection in domains with abundant
logged data, but with a restricted interaction budget. Solving this problem
would enable safe evaluation and deployment of offline reinforcement learning
policies in industry, robotics, and recommendation domains among others.
Several off-policy evaluation (OPE) techniques have been proposed to assess the
value of policies using only logged data. However, there is still a big gap
between the evaluation by OPE and the full online evaluation. Yet, large
amounts of online interactions are often not possible in practice. To overcome
this problem, we introduce active offline policy selection - a novel sequential
decision approach that combines logged data with online interaction to identify
the best policy. We use OPE estimates to warm start the online evaluation.
Then, in order to utilize the limited environment interactions wisely we decide
which policy to evaluate next based on a Bayesian optimization method with a
kernel that represents policy similarity. We use multiple benchmarks, including
real-world robotics, with a large number of candidate policies to show that the
proposed approach improves upon state-of-the-art OPE estimates and pure online
policy evaluation.

本文提出了一种新颖的顺序决策方法 —— 主动离线策略选择，该方法结合了在线交互和记录数据，利用基于贝叶斯优化和策略相似性的内核函数，通过多个基准测试，包括实际机器人应用，证明该方法改进了最新的离线策略评估估计和纯在线策略评估，解决了缺乏在线交互数据的策略选择问题。