Dialogue policy optimization often obtains feedback until task completion in
task-oriented dialogue systems. This is insufficient for training intermediate
dialogue turns since supervision signals (or rewards) are only provided at the
end of dialogues. To address this issue, reward learning has been introduced to
learn from state-action pairs of an optimal policy to provide turn-by-turn
rewards. This approach requires complete state-action annotations of
human-to-human dialogues (i.e., expert demonstrations), which is labor
intensive. To overcome this limitation, we propose a novel reward learning
approach for semi-supervised policy learning. The proposed approach learns a
dynamics model as the reward function which models dialogue progress (i.e.,
state-action sequences) based on expert demonstrations, either with or without
annotations. The dynamics model computes rewards by predicting whether the
dialogue progress is consistent with expert demonstrations. We further propose
to learn action embeddings for a better generalization of the reward function.
The proposed approach outperforms competitive policy learning baselines on
MultiWOZ, a benchmark multi-domain dataset.

本文提出了用于半监督策略学习的新型奖励学习方法，该方法借助动态模型来计算奖励值，并结合动作嵌入进行奖励函数的泛化，从而优于其他竞争性策略学习基线，适用于任务导向型对话系统。