离线强化学习在对话回复生成中的有效性

Jul, 2023

离线强化学习在对话回复生成中的有效性

On the Effectiveness of Offline RL for Dialogue Response Generation

Paloma Sodhi, Felix Wu, Ethan R. Elenberg, Kilian Q. Weinberger, Ryan McDonald

TL;DR研究通过离线强化学习方法在对话响应生成中最大化序列级目标，对多个数据集、模型和度量进行全面评估，离线强化学习相比于教师强制训练能够明显提高性能却不会导致训练不稳定或牺牲实际训练预算。

Abstract

A common training technique for language models is teacher forcing (TF). TF attempts to match human language exactly, even though identical meanings can be expressed in different ways. This motivates use of