Large language models distill broad knowledge from text corpora. However,
they can be inconsistent when it comes to completing user specified tasks. This
issue can be addressed by finetuning such models via supervised learning on
curated datasets, or via reinforcement learning. In this work, we propose a
novel offline RL motivated method, implicit language Q-learning (ILQL),
designed for use on language models, that combines both the flexible utility
optimization framework of traditional RL algorithms with supervised learning's
ability to leverage existing data and its simplicity and stability. Our method,
based on dynamic programming, employs a blend of value conservatism alongside
an implicit dataset support constraint in learning value functions, which are
then used to guide language model generations towards maximizing utility. In
addition to empirically validating ILQL, we present a detailed empirical
analysis of situations where offline RL can be useful in natural language
generation settings, demonstrating how it can be a more effective utility
optimizer than prior approaches for end-to-end dialogue, and how it can
effectively optimize high variance reward functions based on subjective
judgement, such as whether to label a comment as an example of toxic speech or
not.

本文提出了一种离线强化学习方法 ILQL，以结合传统强化学习算法的灵活的优化框架和有监督学习的现有数据利用能力及其简明稳定性的特点，以指导语言模型的生成来最大化效用，并在自然语言生成环境中有效地优化高方差奖励函数。