Improving language model generations according to some user-defined quality or style constraints is challenging. Typical approaches include learning on additional human-written data, filtering ``low-quality'' data using heuristics and/or using reinforcement learning with human feedback (RLHF). However, filtering can remove valuable training signals, whereas data collection and RLHF constantly require additional human-written or LM exploration data which can be costly to obtain. A natural question to ask is ``Can we leverage RL to optimize LM utility on existing crowd-sourced and internet data?'' To this end, we present Left-over Lunch RL (LoL-RL), a simple training algorithm that uses offline policy gradients for learning language generation tasks as a 1-step RL game. LoL-RL can finetune LMs to optimize arbitrary classifier-based or human-defined utility functions on any sequence-to-sequence data. Experiments with five different language generation tasks using models of varying sizes and multiple rewards show that models trained with LoL-RL can consistently outperform the best supervised learning models. We also release our experimental code. https://github.com/abaheti95/LoL-RL

本文提出了一种名为 Left-over Lunch RL (LoL-RL) 的简单算法，通过离线策略梯度学习语言生成任务作为一步强化学习游戏来微调语言模型以优化任意分类器或人为定义的效用函数，并且通过使用多个奖励模型的不同大小的模型和多个任务的实验表明，使用LoL-RL训练的模型可以始终优于最佳监督学习模型。

基于优势离线策略梯度的语言模型优化