We present a new model-based algorithm for reinforcement learning (RL) which consists of explicit exploration and exploitation phases, and is applicable in large or infinite state spaces. The algorithm maintains a set of dynamics models consistent with current experience and explores by finding policies which induce high disagreement between their state predictions. It then exploits using the refined set of models or experience gathered during exploration. We show that under realizability and optimal planning assumptions, our algorithm provably finds a near-optimal policy with a number of samples that is polynomial in a structural complexity measure which we show to be low in several natural settings. We then give a practical approximation using neural networks and demonstrate its performance and sample efficiency in practice.

提出了一种基于模型的强化学习算法，该算法包括明确的探索和利用阶段，并适用于大规模或无限状态空间，该算法维护一组与当前体验一致的动态模型，并通过查找在状态预测之间引起高度分歧的策略来进行探索，然后利用精细化的模型或在探索过程中收集的体验，我们证明，在实现和最优规划的假设下，我们的算法能够用多项式结构复杂度度量在很多自然设置中得到完美的政策，并给出了一个使用神经网络的实用近似，并证明了它在实践中的性能和样本效率。

连续状态空间中的显式探索-利用算法