Task-oriented dialogue (TOD) system is designed to accomplish user-defined tasks through dialogues. The TOD system has progressed towards end-to-end modeling by leveraging pre-trained large language models. Fine-tuning the pre-trained language models using only supervised learning leads to the exposure bias and token loss problem and it deviates the models from completing the user's task. To address these issues, we propose a TOD system that leverages a unified pre-trained language model, GPT2, as a base model. It is optimized using supervised learning and reinforcement learning (RL). The issues in the TOD system are mitigated using a non-differentiable reward function. The reward is calculated using the weighted sum of the success rate and BLEU evaluation metrics. The success rate and BLEU metrics in reward calculation guide the language model for user task completion while ensuring a coherent and fluent response. Our model is acquired by fine-tuning a pre-trained model on the dialogue-session level which comprises user utterance, belief state, system act, and system response. Experimental results on MultiWOZ2.1 demonstrate that our model increases the inform rate by 1.60% and the success rate by 3.17% compared to the baseline.

本研究解决了任务导向对话系统在使用监督学习微调预训练语言模型时出现的曝光偏差和令牌损失问题，导致系统无法有效完成用户任务的问题。通过采用统一的预训练语言模型GPT2，并结合监督学习和强化学习优化，研究提出了一种新的奖励函数，最终实验结果显示该系统在MultiWOZ2.1数据集上成功率提高了3.17%。

利用离线强化学习改善多领域任务导向对话系统