Reinforcement learning (RL) combines a control problem with statistical
estimation: The system dynamics are not known to the agent, but can be learned
through experience. A recent line of research casts `RL as inference' and
suggests a particular framework to generalize the RL problem as probabilistic
inference. Our paper surfaces a key shortcoming in that approach, and clarifies
the sense in which RL can be coherently cast as an inference problem. In
particular, an RL agent must consider the effects of its actions upon future
rewards and observations: The exploration-exploitation tradeoff. In all but the
most simple settings, the resulting inference is computationally intractable so
that practical RL algorithms must resort to approximation. We demonstrate that
the popular `RL as inference' approximation can perform poorly in even very
basic problems. However, we show that with a small modification the framework
does yield algorithms that can provably perform well, and we show that the
resulting algorithm is equivalent to the recently proposed K-learning, which we
further connect with Thompson sampling.

本研究因 RL 作为推理方法的短处而对其进行澄清，RL 代理人必须考虑其行动对未来奖励和观察结果的影响，即探索和开发之间的权衡。我们证明了‘RL 作为推理’近似在基本问题中表现不佳，但我们展示了通过小修正该框架可以获得可靠的算法，该算法与最近提出的 K-learning 等价，我们进一步将其与汤普森取样联系起来。