Offline reinforcement learning algorithms often require careful
hyperparameter tuning. Consequently, before deployment, we need to select
amongst a set of candidate policies. As yet, however, there is little
understanding about the fundamental limits of this offline policy selection
(OPS) problem. In this work we aim to provide clarity on when sample efficient
OPS is possible, primarily by connecting OPS to off-policy policy evaluation
(OPE) and Bellman error (BE) estimation. We first show a hardness result, that
in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to
OPS. As a result, no OPS method can be more sample efficient than OPE in the
worst case. We then propose a BE method for OPS, called Identifiable BE
Selection (IBES), that has a straightforward method for selecting its own
hyperparameters. We highlight that using IBES for OPS generally has more
requirements than OPE methods, but if satisfied, can be more sample efficient.
We conclude with an empirical study comparing OPE and IBES, and by showing the
difficulty of OPS on an offline Atari benchmark dataset.

离线强化学习中的政策选择，样本效率，离线政策评估，以及贝尔曼误差估计等方面的研究。

离线政策选择对强化学习的样本效率何时有效？

When is Offline Policy Selection Sample Efficient for Reinforcement  Learning?

We prove performance guarantees of two algorithms for approximating $Q^\star$
in batch reinforcement learning. Compared to classical iterative methods such
as Fitted Q-Iteration---whose performance loss incurs quadratic dependence on
horizon---these methods estimate (some forms of) the Bellman error and enjoy
linear-in-horizon error propagation, a property established for the first time
for algorithms that rely solely on batch data and output stationary policies.
One of the algorithms uses a novel and explicit importance-weighting correction
to overcome the infamous "double sampling" difficulty in Bellman error
estimation, and does not use any squared losses. Our analyses reveal its
distinct characteristics and potential advantages compared to classical
algorithms.

本文涵盖了两种用于近似 Q 星算法在批量强化学习中表现的性能保证，并与传统的迭代方法进行了比较，证明了这些方法可以通过估计贝尔曼误差，仅依靠批数据和输出静态策略的算法，享受与任务无关的线性迭代时间性质。 其中一种算法使用了新颖而明确的重要性加权校正，以克服贝尔曼误差估计中的 “双重抽样” 难题，并且没有使用任何平方损失。 我们的分析揭示了与传统算法相比，其不同的特点和潜在优势。