Multi-action dialog policy, which generates multiple atomic dialog actions
per turn, has been widely applied in task-oriented dialog systems to provide
expressive and efficient system responses. Existing policy models usually
imitate action combinations from the labeled multi-action dialog examples. Due
to data limitations, they generalize poorly toward unseen dialog flows. While
reinforcement learning-based methods are proposed to incorporate the service
ratings from real users and user simulators as external supervision signals,
they suffer from sparse and less credible dialog-level rewards. To cope with
this problem, we explore to improve multi-action dialog policy learning with
explicit and implicit turn-level user feedback received for historical
predictions (i.e., logged user feedback) that are cost-efficient to collect and
faithful to real-world scenarios. The task is challenging since the logged user
feedback provides only partial label feedback limited to the particular
historical dialog actions predicted by the agent. To fully exploit such
feedback information, we propose BanditMatch, which addresses the task from a
feedback-enhanced semi-supervised learning perspective with a hybrid objective
of semi-supervised learning and bandit learning. BanditMatch integrates
pseudo-labeling methods to better explore the action space through constructing
full label feedback. Extensive experiments show that our BanditMatch
outperforms the state-of-the-art methods by generating more concise and
informative responses. The source code and the appendix of this paper can be
obtained from this https URL.

本文提出了一种基于 BanditMatch 的多动作对话策略学习方法，通过利用显式和隐式的转折用户反馈来提高策略学习效果，该方法综合了半监督学习和万能学习的混合目标。

基于用户反馈日志的多轮对话策略学习

Multi-Action Dialog Policy Learning from Logged User Feedback

Multi-action dialog policy (MADP), which generates multiple atomic dialog
actions per turn, has been widely applied in task-oriented dialog systems to
provide expressive and efficient system responses. Existing MADP models usually
imitate action combinations from the labeled multi-action dialog samples. Due
to data limitations, they generalize poorly toward unseen dialog flows. While
interactive learning and reinforcement learning algorithms can be applied to
incorporate external data sources of real users and user simulators, they take
significant manual effort to build and suffer from instability. To address
these issues, we propose Planning Enhanced Dialog Policy (PEDP), a novel
multi-task learning framework that learns single-action dialog dynamics to
enhance multi-action prediction. Our PEDP method employs model-based planning
for conceiving what to express before deciding the current response through
simulating single-action dialogs. Experimental results on the MultiWOZ dataset
demonstrate that our fully supervised learning-based method achieves a solid
task success rate of 90.6%, improving 3% compared to the state-of-the-art
methods.

本文提出了一种基于多任务学习框架的 Planning Enhanced Dialog Policy (PEDP) 方法，使用模型规划来模拟单动作对话，从而增强多动作预测，实现了相对于现有状态下最先进方法的 3% 提高，达到了 90.6% 的可靠任务成功率。