Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty. In this setting, a Bayes-optimal policy captures the ideal trade-off between exploration and exploitation. Unfortunately, finding Bayes-optimal policies is notoriously taxing due to the enormous search space in the augmented belief-state MDP. In this paper we exploit recent advances in sample-based planning, based on Monte-Carlo tree search, to introduce a tractable method for approximate Bayes-optimal planning. Unlike prior work in this area, we avoid expensive applications of Bayes rule within the search tree, by lazily sampling models from the current beliefs. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems.

本研究提出了一种基于蒙特卡洛树搜索(Monte-Carlo tree search)的可行的、基于样本的近似贝叶斯最优规划方法，它避免了在搜索树中昂贵的应用贝叶斯规则，通过从当前信念中懒惰地抽样模型。实验证明，与以前的贝叶斯模型为基础的RL算法相比，在几个知名的基准问题上，我们的方法表现出了明显的优势。

使用基于样本的搜索实现高效的贝叶斯自适应强化学习