Multi-action dialog policy, which generates multiple atomic dialog actions
per turn, has been widely applied in task-oriented dialog systems to provide
expressive and efficient system responses. Existing policy models usually
imitate action combinations from the labeled multi-action dialog examples. Due
to data limitations, they generalize poorly toward unseen dialog flows. While
reinforcement learning-based methods are proposed to incorporate the service
ratings from real users and user simulators as external supervision signals,
they suffer from sparse and less credible dialog-level rewards. To cope with
this problem, we explore to improve multi-action dialog policy learning with
explicit and implicit turn-level user feedback received for historical
predictions (i.e., logged user feedback) that are cost-efficient to collect and
faithful to real-world scenarios. The task is challenging since the logged user
feedback provides only partial label feedback limited to the particular
historical dialog actions predicted by the agent. To fully exploit such
feedback information, we propose BanditMatch, which addresses the task from a
feedback-enhanced semi-supervised learning perspective with a hybrid objective
of semi-supervised learning and bandit learning. BanditMatch integrates
pseudo-labeling methods to better explore the action space through constructing
full label feedback. Extensive experiments show that our BanditMatch
outperforms the state-of-the-art methods by generating more concise and
informative responses. The source code and the appendix of this paper can be
obtained from this https URL.

本文提出了一种基于 BanditMatch 的多动作对话策略学习方法，通过利用显式和隐式的转折用户反馈来提高策略学习效果，该方法综合了半监督学习和万能学习的混合目标。