Model-based reinforcement learning algorithms with probabilistic dynamical models are amongst the most data-efficient learning methods. This is often attributed to their ability to distinguish between epistemic and aleatoric uncertainty. However, while most algorithms distinguish these two uncertainties for {\em learning} the model, they ignore it when {\em optimizing} the policy. In this paper, we show that ignoring the epistemic uncertainty leads to greedy algorithms that do not explore sufficiently. In turn, we propose a {\em practical optimistic-exploration algorithm} (\alg), which enlarges the input space with {\em hallucinated} inputs that can exert as much control as the {\em epistemic} uncertainty in the model affords. We analyze this setting and construct a general regret bound for well-calibrated models, which is provably sublinear in the case of Gaussian Process models. Based on this theoretical foundation, we show how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms and different probabilistic models. Our experiments demonstrate that optimistic exploration significantly speeds up learning when there are penalties on actions, a setting that is notoriously difficult for existing model-based reinforcement learning algorithms.

本文提出了一种基于模型的加强学习算法（H-UCRL），通过加强其输入空间并直接使用先验不确定性来提高探索，使得优化策略时也能区分先验不确定性和先验确定性。同时，本文针对H-UCRL分析了一般的后悔界，并构建了一个在高斯过程模型下证明的可证明次线性的界，进而表明乐观探索可以轻松地与最先进的强化学习算法以及不同的概率模型相结合。实验表明，本文所提出的算法在已知惩罚的情况下可以显著加速学习，并且在现有的基于模型的加强学习算法中具有广泛的适用性。

通过乐观策略搜索和规划实现高效的基于模型的强化学习