In reinforcement learning, off-policy evaluation (OPE) is the problem of
estimating the expected return of an evaluation policy given a fixed dataset
that was collected by running one or more different policies. One of the more
empirically successful algorithms for OPE has been the fitted q-evaluation
(FQE) algorithm that uses temporal difference updates to learn an action-value
function, which is then used to estimate the expected return of the evaluation
policy. Typically, the original fixed dataset is fed directly into FQE to learn
the action-value function of the evaluation policy. Instead, in this paper, we
seek to enhance the data-efficiency of FQE by first transforming the fixed
dataset using a learned encoder, and then feeding the transformed dataset into
FQE. To learn such an encoder, we introduce an OPE-tailored state-action
behavioral similarity metric, and use this metric and the fixed dataset to
learn an encoder that models this metric. Theoretically, we show that this
metric allows us to bound the error in the resulting OPE estimate. Empirically,
we show that other state-action similarity metrics lead to representations that
cannot represent the action-value function of the evaluation policy, and that
our state-action representation method boosts the data-efficiency of FQE and
lowers OPE error relative to other OPE-based representation learning methods on
challenging OPE tasks. We also empirically show that the learned
representations significantly mitigate divergence of FQE under varying
distribution shifts. Our code is available here:
this https URL

该研究通过引入一个 OPE-tailored 的状态 - 动作行为相似性度量并使用固定数据集来学习该度量以增加数据效率，证明了这种度量可以限制导致的 OPE 估计误差，并通过实证研究证明这种学习表示方法相对于其他 OPE-based 表示学习方法在具有挑战性的 OPE 任务中提高了 FQE 的数据效率并降低了 OPE 误差，该方法还可以在不同分布变化时显著减轻 FQE 的发散问题。

基于状态 - 动作相似性的离线策略评估

State-Action Similarity-Based Representations for Off-Policy Evaluation

We are concerned with the problem of hyperparameter selection for the fitted
Q-evaluation (FQE). FQE is one of the state-of-the-art method for offline
policy evaluation (OPE), which is essential to the reinforcement learning
without environment simulators. However, like other OPE methods, FQE is not
hyperparameter-free itself and that undermines the utility in real-life
applications. We address this issue by proposing a framework of approximate
hyperparameter selection (AHS) for FQE, which defines a notion of optimality
(called selection criteria) in a quantitative and interpretable manner without
hyperparameters. We then derive four AHS methods each of which has different
characteristics such as distribution-mismatch tolerance and time complexity. We
also confirm in experiments that the error bound given by the theory matches
empirical observations.

该研究针对 FQE 算法的超参数调优问题，提出了一种基于近似超参数选择框架的优化方法，该方法不需要超参数就可以定义一种量化且可解释的最优化标准，并验证理论误差界与实际观察的匹配。

拟合 Q 评估的超参数选择方法及误差保证

Hyperparameter Selection Methods for Fitted Q-Evaluation with Error  Guarantee

Bootstrapping provides a flexible and effective approach for assessing the
quality of batch reinforcement learning, yet its theoretical property is less
understood. In this paper, we study the use of bootstrapping in off-policy
evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE)
that is known to be minimax-optimal in the tabular and linear-model cases. We
propose a bootstrapping FQE method for inferring the distribution of the policy
evaluation error and show that this method is asymptotically efficient and
distributionally consistent for off-policy statistical inference. To overcome
the computation limit of bootstrapping, we further adapt a subsampling
procedure that improves the runtime by an order of magnitude. We numerically
evaluate the bootrapping method in classical RL environments for confidence
interval estimation, estimating the variance of off-policy evaluator, and
estimating the correlation between multiple off-policy evaluators.

本文探讨了自举法在强化学习中的应用和如何提高自举法的计算效率，使用 FQE 方法进行策略评估，并用数值实验评估自举法在强化学习中的潜力。