When learning task-oriented dialogue (ToD) agents, reinforcement learning
(RL) techniques can naturally be utilized to train dialogue strategies to
achieve user-specific goals. Prior works mainly focus on adopting advanced RL
techniques to train the ToD agents, while the design of the reward function is
not well studied. This paper aims at answering the question of how to
efficiently learn and leverage a reward function for training end-to-end (E2E)
ToD agents. Specifically, we introduce two generalized objectives for
reward-function learning, inspired by the classical learning-to-rank
literature. Further, we utilize the learned reward function to guide the
training of the E2E ToD agent. With the proposed techniques, we achieve
competitive results on the E2E response-generation task on the Multiwoz 2.0
dataset. Source code and checkpoints are publicly released at
this https URL.

本文介绍了两种常见的奖励函数学习方法，并使用这些方法指导 end-to-end ToD 代理的训练，在 Multiwoz 2.0 数据集上取得了有竞争力的结果。