Reinforcement learning (RL) combines a control problem with statistical estimation: the system dynamics are not known to the agent, but can be learned through experience. A recent line of research casts `RL as inference' and suggests a particular framework to generalize the RL problem as probabilistic inference. Our paper surfaces a key shortcoming in that approach, and clarifies the sense in which RL can be coherently cast as an inference problem. In particular, an RL agent must consider the effects of its actions upon future rewards and observations: the exploration-exploitation tradeoff. In all but the most simple settings, the resulting inference is computationally intractable so that practical RL algorithms must resort to approximation. We demonstrate that the popular `RL as inference' approximation can perform poorly in even very basic problems. However, we show that with a small modification the framework does yield algorithms that can provably perform well, and we show that the resulting algorithm is equivalent to the recently proposed K-learning, which we further connect with Thompson sampling.

本研究因RL作为推理方法的短处而对其进行澄清，RL代理人必须考虑其行动对未来奖励和观察结果的影响，即探索和开发之间的权衡。我们证明了‘RL作为推理’近似在基本问题中表现不佳，但我们展示了通过小修正该框架可以获得可靠的算法，该算法与最近提出的K-learning等价，我们进一步将其与汤普森取样联系起来。

强化学习与概率推断的理解