A popular perspective in Reinforcement learning (RL) casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP). The core object of study is the probability of each state-action pair being visited under the optimal policy. Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statistical inference and consequently do not perform well in challenging problems. In this work, we undertake a rigorous Bayesian treatment of the posterior probability of state-action optimality and clarify how it flows through the MDP. We first reveal that this quantity can indeed be used to generate a policy that explores efficiently, as measured by regret. Unfortunately, computing it is intractable, so we derive a new variational Bayesian approximation yielding a tractable convex optimization problem and establish that the resulting policy also explores efficiently. We call our approach VAPOR and show that it has strong connections to Thompson sampling, K-learning, and maximum entropy exploration. We conclude with some experiments demonstrating the performance advantage of a deep RL version of VAPOR.

强化学习中，通过马尔科夫决策过程的图形模型，以概率推理的方式对各状态-行为对的访问概率进行研究。本研究采用贝叶斯方法，严格处理了状态-行为优化的后验概率，并阐明了其在马尔科夫决策过程中的流动方式。通过引入变分贝叶斯近似方法，得到了一个可行的凸优化问题，建立的策略也能有效地进行探索。该方法称为VAPOR，与汤普森抽样、K学习和最大熵探索有着紧密的联系。通过一些实验，展示了深度强化学习版本VAPOR在性能上的优势。

强化学习中的概率推理正确实施