We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar uncertainty Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the estimated value of any fixed policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for $\epsilon$-greedy improves DQN performance on 51 out of 57 games in the Atari suite.

本文中，我们考虑了强化学习中的探索/利用问题，提出了不确定性Bellman方程(UBE)来扩展策略的潜在探索利益，并证明了该方程的唯一不动点产生的方差上限是由任何策略引起的Q值的后验分布，相较于传统的基于计数的奖励方法，它控制了方差，将UBE探索策略替换为ε-greedy可提高在Atari游戏中DQN性能的表现。

不确定贝尔曼方程与探索