We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation (UBE), but the over-approximation may result in inefficient exploration. We propose a new UBE whose solution converges to the true posterior variance over values and leads to lower regret in tabular exploration problems. We identify challenges to apply the UBE theory beyond tabular problems and propose a suitable approximation. Based on this approximation, we introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC), that can be applied for either risk-seeking or risk-averse policy optimization with minimal changes. Experiments in both online and offline RL demonstrate improved performance compared to other uncertainty estimation methods.

基于模型的强化学习中，我们考虑量化预期累积奖励的不确定性问题。我们提出了一个新的不确定性 Bellman 方程，其收敛到真实后验价值方差并在表格型探索问题中降低遗憾。我们鉴定了超越表格问题的应用挑战，并提出了相应的近似方法。基于这个近似，我们引入了一种通用的策略优化算法，Q-不确定性软 Actor-Critic（QU-SAC），可在风险追求或风险规避的策略优化中进行最小程度改动。在线与离线强化学习的实验结果表明相较于其他不确定性估计方法，性能得到了提升。

基于模型的风险意识策略优化的认知变异性