In the standard Markov decision process formalism, users specify tasks by writing down a reward function. However, in many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm from first principles that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states. Prior work has approached similar problem settings in a two-stage process, first learning an auxiliary reward function and then optimizing this reward function using another reinforcement learning algorithm. In contrast, we derive a method based on recursive classification that eschews auxiliary reward functions and instead directly learns a value function from transitions and successful outcomes. Our method therefore requires fewer hyperparameters to tune and lines of code to debug. We show that our method satisfies a new data-driven Bellman equation, where examples take the place of the typical reward function term. Experiments show that our approach outperforms prior methods that learn explicit reward functions.

本文介绍了一种强化学习的算法，可以更加方便地让用户指定任务，通过提供成功结果的示例来代替复杂且需要技术专业知识的奖励函数。该方法不需要中间奖励函数的学习，仅仅依靠转移和成功结果来学习价值函数，从而需要调整的超参数较少并且代码读起来更加简单易懂。实验结果表明，此方法优于先前学习显式奖励函数的方法。

基于递归分类的基于样例策略搜索替代奖励方法