Reinforcement Learning aims at identifying and evaluating efficient control
policies from data. In many real-world applications, the learner is not allowed
to experiment and cannot gather data in an online manner (this is the case when
experimenting is expensive, risky or unethical). For such applications, the
reward of a given policy (the target policy) must be estimated using historical
data gathered under a different policy (the behavior policy). Most methods for
this learning task, referred to as Off-Policy Evaluation (OPE), do not come
with accuracy and certainty guarantees. We present a novel OPE method based on
Conformal Prediction that outputs an interval containing the true reward of the
target policy with a prescribed level of certainty. The main challenge in OPE
stems from the distribution shift due to the discrepancies between the target
and the behavior policies. We propose and empirically evaluate different ways
to deal with this shift. Some of these methods yield conformalized intervals
with reduced length compared to existing approaches, while maintaining the same
certainty level.

提出了一种基于符合预测的 OPE 方法，可以在给定的一定置信水平下输出包含目标策略真实奖励的区间，并通过不同的方法处理由于目标策略和行为策略之间差异导致的分布偏移，并在保持相同置信水平的情况下，相对于现有方法降低区间长度。

马尔可夫决策过程中的一致离线评估

Conformal Off-Policy Evaluation in Markov Decision Processes

Auxiliary tasks have been argued to be useful for representation learning in
reinforcement learning. Although many auxiliary tasks have been empirically
shown to be effective for accelerating learning on the main task, it is not yet
clear what makes useful auxiliary tasks. Some of the most promising results are
on the pixel control, reward prediction, and the next state prediction
auxiliary tasks; however, the empirical results are mixed, showing substantial
improvements in some cases and marginal improvements in others. Careful
investigations of how auxiliary tasks help the learning of the main task is
necessary. In this paper, we take a step studying the effect of the target
policies on the usefulness of the auxiliary tasks formulated as general value
functions. General value functions consist of three core elements: 1) policy 2)
cumulant 3) continuation function. Our focus on the role of the target policy
of the auxiliary tasks is motivated by the fact that the target policy
determines the behavior about which the agent wants to make a prediction and
the state-action distribution that the agent is trained on, which further
affects the main task learning. Our study provides insights about questions
such as: Does a greedy policy result in bigger improvement gains compared to
other policies? Is it best to set the auxiliary task policy to be the same as
the main task policy? Does the choice of the target policy have a substantial
effect on the achieved performance gain or simple strategies for setting the
policy, such as using a uniformly random policy, work as well? Our empirical
results suggest that: 1) Auxiliary tasks with the greedy policy tend to be
useful. 2) Most policies, including a uniformly random policy, tend to improve
over the baseline. 3) Surprisingly, the main task policy tends to be less
useful compared to other policies.

本研究研究探讨在强化学习中，作为表示学习的辅助任务（auxiliary tasks）的目标策略（target policy）对主任务（main task）学习的影响，实证结果表明，贪心策略的辅助任务往往有效，而在所有策略中，甚至包括均匀随机策略，通常都比基线更有效。与其他策略相比，主任务策略往往不太有效。

强化学习中有用的辅助任务是什么：研究目标策略的影响

What makes useful auxiliary tasks in reinforcement learning: investigating the effect of the target policy

Accurately evaluating new policies (e.g. ad-placement models, ranking
functions, recommendation functions) is one of the key prerequisites for
improving interactive systems. While the conventional approach to evaluation
relies on online A/B tests, recent work has shown that counterfactual
estimators can provide an inexpensive and fast alternative, since they can be
applied offline using log data that was collected from a different policy
fielded in the past. In this paper, we address the question of how to estimate
the performance of a new target policy when we have log data from multiple
historic policies. This question is of great relevance in practice, since
policies get updated frequently in most online systems. We show that naively
combining data from multiple logging policies can be highly suboptimal. In
particular, we find that the standard Inverse Propensity Score (IPS) estimator
suffers especially when logging and target policies diverge -- to a point where
throwing away data improves the variance of the estimator. We therefore propose
two alternative estimators which we characterize theoretically and compare
experimentally. We find that the new estimators can provide substantially
improved estimation accuracy.

本文研究了如何利用历史数据来预测目标策略的性能，并提出了两种替代方法，相比于传统方法，能够更准确地评估交互式系统的新政策。