We consider off-policy evaluation (OPE) of deterministic target policies for
reinforcement learning (RL) in environments with continuous action spaces.
While it is common to use importance sampling for OPE, it suffers from high
variance when the behavior policy deviates significantly from the target
policy. In order to address this issue, some recent works on OPE proposed
in-sample learning with importance resampling. Yet, these approaches are not
applicable to deterministic target policies for continuous action spaces. To
address this limitation, we propose to relax the deterministic target policy
using a kernel and learn the kernel metrics that minimize the overall mean
squared error of the estimated temporal difference update vector of an action
value function, where the action value function is used for policy evaluation.
We derive the bias and variance of the estimation error due to this relaxation
and provide analytic solutions for the optimal kernel metric. In empirical
studies using various test domains, we show that the OPE with in-sample
learning using the kernel with optimized metric achieves significantly improved
accuracy than other baselines.

在连续动作空间中，通过使用优化的核度量，通过样本内学习的离策略评估可以显著提高准确性。

核度量学习：用于确定性强化学习策略的样本内离策略评估

Kernel Metric Learning for In-Sample Off-Policy Evaluation of  Deterministic RL Policies

Selecting a suitable training dataset is crucial for both general-domain
(e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We
formalize this data selection problem as selecting a subset of a large raw
unlabeled dataset to match a desired target distribution, given some unlabeled
target samples. Due to the large scale and dimensionality of the raw text data,
existing methods use simple heuristics to select data that are similar to a
high-quality reference corpus (e.g., Wikipedia), or leverage experts to
manually curate data. Instead, we extend the classic importance resampling
approach used in low-dimensions for LM data selection. Crucially, we work in a
reduced feature space to make importance weight estimation tractable over the
space of text. To determine an appropriate feature space, we first show that KL
reduction, a data metric that measures the proximity between selected data and
the target in a feature space, has high correlation with average accuracy on 8
downstream tasks (r=0.89) when computed with simple n-gram features. From this
observation, we present Data Selection with Importance Resampling (DSIR), an
efficient and scalable algorithm that estimates importance weights in a reduced
feature space (e.g., n-gram features in our instantiation) and selects data
with importance resampling according to these weights. When training
general-domain models (target is Wikipedia + books), DSIR improves over random
selection and heuristic filtering baselines by 2--2.5% on the GLUE benchmark.
When performing continued pretraining towards a specific domain, DSIR performs
comparably to expert curated data across 8 target distributions.

本文介绍了一种基于重要性重采样的数据选择算法，该算法可以在减少特征空间的基础上从大型无标签数据集中选择与目标分布匹配的样本子集。在训练通用领域（例如维基百科）和特定领域的语言模型时，该算法能够显着提高模型的性能。