Many applications in Reinforcement Learning (RL) usually have noise or
stochasticity present in the environment. Beyond their impact on learning,
these uncertainties lead the exact same policy to perform differently, i.e.
yield different return, from one roll-out to another. Common evaluation
procedures in RL summarise the consequent return distributions using solely the
expected return, which does not account for the spread of the distribution. Our
work defines this spread as the policy reproducibility: the ability of a policy
to obtain similar performance when rolled out many times, a crucial property in
some real-world applications. We highlight that existing procedures that only
use the expected return are limited on two fronts: first an infinite number of
return distributions with a wide range of performance-reproducibility
trade-offs can have the same expected return, limiting its effectiveness when
used for comparing policies; second, the expected return metric does not leave
any room for practitioners to choose the best trade-off value for considered
applications. In this work, we address these limitations by recommending the
use of Lower Confidence Bound, a metric taken from Bayesian optimisation that
provides the user with a preference parameter to choose a desired
performance-reproducibility trade-off. We also formalise and quantify policy
reproducibility, and demonstrate the benefit of our metrics using extensive
experiments of popular RL algorithms on common uncertain RL tasks.

研究表明，强化学习中存在噪音和随机性，现有的评估程序仅使用期望回报评估政策，限制其在比较政策和选择最佳权衡值方面的有效性。本研究通过推荐使用贝叶斯优化中的置信下界指标，为用户提供选择所需性能与重复性权衡的参数，并通过大量实验验证了这些指标的益处。

超越预期回报：在评估强化学习算法时考虑政策可复制性

Beyond Expected Return: Accounting for Policy Reproducibility when  Evaluating Reinforcement Learning Algorithms

Offline (or batch) reinforcement learning (RL) algorithms seek to learn an
optimal policy from a fixed dataset without active data collection. Based on
the composition of the offline dataset, two main categories of methods are
used: imitation learning which is suitable for expert datasets and vanilla
offline RL which often requires uniform coverage datasets. From a practical
standpoint, datasets often deviate from these two extremes and the exact data
composition is usually unknown a priori. To bridge this gap, we present a new
offline RL framework that smoothly interpolates between the two extremes of
data composition, hence unifying imitation learning and vanilla offline RL. The
new framework is centered around a weak version of the concentrability
coefficient that measures the deviation from the behavior policy to the expert
policy alone.
Under this new framework, we further investigate the question on algorithm
design: can one develop an algorithm that achieves a minimax optimal rate and
also adapts to unknown data composition? To address this question, we consider
a lower confidence bound (LCB) algorithm developed based on pessimism in the
face of uncertainty in offline RL. We study finite-sample properties of LCB as
well as information-theoretic limits in multi-armed bandits, contextual
bandits, and Markov decision processes (MDPs). Our analysis reveals surprising
facts about optimality rates. In particular, in all three settings, LCB
achieves a faster rate of $1/N$ for nearly-expert datasets compared to the
usual rate of $1/\sqrt{N}$ in offline RL, where $N$ is the number of samples in
the batch dataset. In the case of contextual bandits with at least two
contexts, we prove that LCB is adaptively optimal for the entire data
composition range, achieving a smooth transition from imitation learning to
offline RL. We further show that LCB is almost adaptively optimal in MDPs.

提出了一种新的离线强化学习框架，将模仿学习和通用离线强化学习相结合，中心思想是测量从行为策略到专家策略的偏差，进一步研究了针对未知数据分布下的算法设计问题，提出了一种基于悲观策略的下限置信度算法 LCB，在多臂赌博机、情境赌博机和马尔可夫决策过程中进行了有限样本性能研究，结果揭示了一些有关最优性率的令人惊讶的事实。