A popular perspective in Reinforcement learning (RL) casts the problem as
probabilistic inference on a graphical model of the Markov decision process
(MDP). The core object of study is the probability of each state-action pair
being visited under the optimal policy. Previous approaches to approximate this
quantity can be arbitrarily poor, leading to algorithms that do not implement
genuine statistical inference and consequently do not perform well in
challenging problems. In this work, we undertake a rigorous Bayesian treatment
of the posterior probability of state-action optimality and clarify how it
flows through the MDP. We first reveal that this quantity can indeed be used to
generate a policy that explores efficiently, as measured by regret.
Unfortunately, computing it is intractable, so we derive a new variational
Bayesian approximation yielding a tractable convex optimization problem and
establish that the resulting policy also explores efficiently. We call our
approach VAPOR and show that it has strong connections to Thompson sampling,
K-learning, and maximum entropy exploration. We conclude with some experiments
demonstrating the performance advantage of a deep RL version of VAPOR.

强化学习中，通过马尔科夫决策过程的图形模型，以概率推理的方式对各状态 - 行为对的访问概率进行研究。本研究采用贝叶斯方法，严格处理了状态 - 行为优化的后验概率，并阐明了其在马尔科夫决策过程中的流动方式。通过引入变分贝叶斯近似方法，得到了一个可行的凸优化问题，建立的策略也能有效地进行探索。该方法称为 VAPOR，与汤普森抽样、K 学习和最大熵探索有着紧密的联系。通过一些实验，展示了深度强化学习版本 VAPOR 在性能上的优势。