While deep reinforcement learning (RL) agents have showcased strong results
across many domains, a major concern is their inherent opaqueness and the
safety of such systems in real-world use cases. To overcome these issues, we
need agents that can quantify their uncertainty and detect out-of-distribution
(OOD) states. Existing uncertainty estimation techniques, like Monte-Carlo
Dropout or Deep Ensembles, have not seen widespread adoption in on-policy deep
RL. We posit that this is due to two reasons: concepts like uncertainty and OOD
states are not well defined compared to supervised learning, especially for
on-policy RL methods. Secondly, available implementations and comparative
studies for uncertainty estimation methods in RL have been limited. To overcome
the first gap, we propose definitions of uncertainty and OOD for Actor-Critic
RL algorithms, namely, proximal policy optimization (PPO), and present possible
applicable measures. In particular, we discuss the concepts of value and policy
uncertainty. The second point is addressed by implementing different
uncertainty estimation methods and comparing them across a number of
environments. The OOD detection performance is evaluated via a custom
evaluation benchmark of in-distribution (ID) and OOD states for various RL
environments. We identify a trade-off between reward and OOD detection
performance. To overcome this, we formulate a Pareto optimization problem in
which we simultaneously optimize for reward and OOD detection performance. We
show experimentally that the recently proposed method of Masksembles strikes a
favourable balance among the survey methods, enabling high-quality uncertainty
estimation and OOD detection while matching the performance of original RL
agents.

该研究提出了 Actor-Critic RL 算法的不确定性和 OOD 状态的定义，使用多种不确定性估计方法，展示不同的 OOD 检测性能，并提出了一种 Pareto 优化问题的解决方案，应用 Masksembles 方法成功的平衡了奖励和 OOD 检测性能。

如何在近端策略优化中实现不确定性估计

How to Enable Uncertainty Estimation in Proximal Policy Optimization

Deep neural networks have amply demonstrated their prowess but estimating the
reliability of their predictions remains challenging. Deep Ensembles are widely
considered as being one of the best methods for generating uncertainty
estimates but are very expensive to train and evaluate. MC-Dropout is another
popular alternative, which is less expensive, but also less reliable. Our
central intuition is that there is a continuous spectrum of ensemble-like
models of which MC-Dropout and Deep Ensembles are extreme examples. The first
uses an effectively infinite number of highly correlated models while the
second relies on a finite number of independent models.
To combine the benefits of both, we introduce Masksembles. Instead of
randomly dropping parts of the network as in MC-dropout, Masksemble relies on a
fixed number of binary masks, which are parameterized in a way that allows to
change correlations between individual models. Namely, by controlling the
overlap between the masks and their density one can choose the optimal
configuration for the task at hand. This leads to a simple and easy to
implement method with performance on par with Ensembles at a fraction of the
cost. We experimentally validate Masksembles on two widely used datasets,
CIFAR10 and ImageNet.

本文介绍了一种新的深度学习模型 ——Masksembles，它是 Deep Ensembles 和 MC-Dropout 的结合体，通过固定数量的二元掩码，控制模型之间的相关性，以较小的代价实现与 Ensembles 相当的性能。