While reinforcement learning algorithms provide automated acquisition of optimal policies, practical application of such methods requires a number of design decisions, such as manually designing reward functions that not only define the task, but also provide sufficient shaping to accomplish it. In this paper, we discuss a new perspective on reinforcement learning, recasting it as the problem of inferring actions that achieve desired outcomes, rather than a problem of maximizing rewards. To solve the resulting outcome-directed inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function which can be learned directly from environment interactions. From the corresponding variational objective, we also derive a new probabilistic Bellman backup operator reminiscent of the standard Bellman backup operator and use it to develop an off-policy algorithm to solve goal-directed tasks. We empirically demonstrate that this method eliminates the need to design reward functions and leads to effective goal-directed behaviors.

通过提出一种新的变分推断形式，从环境交互中直接学习良好的奖励函数，并使用新的概率贝尔曼反演运算符，发展了一种离线策略算法来解决目标导向任务，该方法消除了手工制作奖励函数的需要，并对各种机械操纵和运动任务产生了有效的目标导向行为。

通过变分推断实现基于结果的强化学习