Despite remarkable successes, deep reinforcement learning algorithms remain sample inefficient: they require an enormous amount of trial and error to find good policies. Model-based algorithms promise sample efficiency by building an environment model that can be used for planning. Posterior Sampling for Reinforcement Learning is such a model-based algorithm that has attracted significant interest due to its performance in the tabular setting. This paper introduces Posterior Sampling for Deep Reinforcement Learning (PSDRL), the first truly scalable approximation of Posterior Sampling for Reinforcement Learning that retains its model-based essence. PSDRL combines efficient uncertainty quantification over latent state space models with a specially tailored continual planning algorithm based on value-function approximation. Extensive experiments on the Atari benchmark show that PSDRL significantly outperforms previous state-of-the-art attempts at scaling up posterior sampling while being competitive with a state-of-the-art (model-based) reinforcement learning method, both in sample efficiency and computational efficiency.

本文介绍了一种名为PSDRL的算法，它是第一个真正可扩展的近似后验采样强化学习算法，它结合了基于值函数近似的连续计划算法和对潜在状态空间模型的高效不确定性量化，经过在Atari基准测试上进行广泛实验，PSDRL在样本效率和计算效率上显著优于以前的尝试并在与基于模型的强化学习方法相比具备竞争力。

深度强化学习的后验抽样