Model-based reinforcement learning algorithms tend to achieve higher sample efficiency than model-free methods. However, due to the inevitable errors of learned models, model-based methods struggle to achieve the same asymptotic performance as model-free methods. In this paper, We propose a Policy Optimization method with Model-Based Uncertainty (POMBU)---a novel model-based approach---that can effectively improve the asymptotic performance using the uncertainty in Q-values. We derive an upper bound of the uncertainty, based on which we can approximate the uncertainty accurately and efficiently for model-based methods. We further propose an uncertainty-aware policy optimization algorithm that optimizes the policy conservatively to encourage performance improvement with high probability. This can significantly alleviate the overfitting of policy to inaccurate models. Experiments show POMBU can outperform existing state-of-the-art policy optimization algorithms in terms of sample efficiency and asymptotic performance. Moreover, the experiments demonstrate the excellent robustness of POMBU compared to previous model-based approaches.

这篇论文提出了一种新的基于模型不确定性的政策优化方法POMBU，通过利用Q值的不确定性，可以有效提高渐近性能并提高样本效率，并通过保守的优化算法实现鲁棒性。实验证明，POMBU在样本效率和渐近性能方面优于现有的最先进的算法，并且相对于以前的基于模型的方法具有很好的鲁棒性。

基于深度模型的强化学习：通过估计不确定性和保守策略优化