Model-based reinforcement learning (RL) algorithms can attain excellent sample efficiency, but often lag behind the best model-free algorithms in terms of asymptotic performance, especially those with high-capacity parametric function approximators, such as deep networks. In this paper, we study how to bridge this gap, by employing uncertainty-aware dynamics models. We propose a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation. Our comparison to state-of-the-art model-based and model-free deep RL algorithms shows that our approach matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples (e.g. 25 and 125 times fewer samples than Soft Actor Critic and Proximal Policy Optimization respectively on the half-cheetah task).

本研究旨在利用基于不确定性的深度网络动态模型来提高回报函数学习算法的样本效率，并通过样本传播方法实现不确定性处理，从而解决参数化函数逼近器，如深度网络的性能下降问题，我们提出了一种名为PETS的新算法。与深度强化学习的先进算法进行比较，结果表明我们的方法可以在Asymptotic Performance上与模型自由算法匹配，并且在许多具有挑战性的基准任务中需要明显较少的样本数量（例如，在半猎豹任务中所需样本数量比Soft Actor Critic和Proximal Policy Optimization分别减少8倍和125倍）。

使用概率动态模型进行少量试验的深度强化学习