Learning a goal-oriented dialog policy is generally performed offline with supervised learning algorithms or online with reinforcement learning (RL). Additionally, as companies accumulate massive quantities of dialog transcripts between customers and trained human agents, encoder-decoder methods have gained popularity as agent utterances can be directly treated as supervision without the need for utterance-level annotations. However, one potential drawback of such approaches is that they myopically generate the next agent utterance without regard for dialog-level considerations. To resolve this concern, this paper describes an offline RL method for learning from unannotated corpora that can optimize a goal-oriented policy at both the utterance and dialog level. We introduce a novel reward function and use both on-policy and off-policy policy gradient to learn a policy offline without requiring online user interaction or an explicit state space definition.

本文提出了一种离线强化学习方法，可从未标注的语料库中学习，既可以在话语级别上进行优化又可以在对话级别上进行优化，解决了现有方法对话级别考虑不足的问题，并使用了一种新的奖励函数和在线/离线策略梯度来学习无需在线用户交互或显式状态空间定义的策略。

使用策略梯度的端到端离线目标导向型对话策略学习