An inherent problem in reinforcement learning is coping with policies that are uncertain about what action to take (or the value of a state). Model uncertainty, more formally known as epistemic uncertainty, refers to the expected prediction error of a model beyond the sampling noise. In this paper, we propose a metric for epistemic uncertainty estimation in Q-value functions, which we term pathwise epistemic uncertainty. We further develop a method to compute its approximate upper bound, which we call F -value. We experimentally apply the latter to Deep Q-Networks (DQN) and show that uncertainty estimation in reinforcement learning serves as a useful indication of learning progress. We then propose a new approach to improving sample efficiency in actor-critic algorithms by learning from an existing (previously learned or hard-coded) oracle policy while uncertainty is high, aiming to avoid unproductive random actions during training. We term this Critic Confidence Guided Exploration (CCGE). We implement CCGE on Soft Actor-Critic (SAC) using our F-value metric, which we apply to a handful of popular Gym environments and show that it achieves better sample efficiency and total episodic reward than vanilla SAC in limited contexts.

本文提出并应用一种度量Q-值函数中认知不确定性的度量标准，称为路径认知不确定性，并开发了一种计算其近似上限的方法F-值。我们在Deep Q-Networks (DQN)中实验性地应用其来表明在强化学习中的不确定性估计是学习进展的有用指标，并提出了基于CritiC的置信度引导探索（CCGE）的新方法，以在不确定性高时从现有（之前学习或预先编码）的oracle策略中学习，以避免训练期间无效的随机动作。然后我们应用该方法到Soft Actor-Critic(SAC)，并在几个常见的Gym环境中表明它比普通SAC表现更好。

需要一些监管：通过认知不确定性指标在强化学习中融入 Oracle 政策