While large-scale unsupervised language models (LMs) learn broad world
knowledge and some reasoning skills, achieving precise control of their
behavior is difficult due to the completely unsupervised nature of their
training. Existing methods for gaining such steerability collect human labels
of the relative quality of model generations and fine-tune the unsupervised LM
to align with these preferences, often with reinforcement learning from human
feedback (RLHF). However, RLHF is a complex and often unstable procedure, first
fitting a reward model that reflects the human preferences, and then
fine-tuning the large unsupervised LM using reinforcement learning to maximize
this estimated reward without drifting too far from the original model. In this
paper, we leverage a mapping between reward functions and optimal policies to
show that this constrained reward maximization problem can be optimized exactly
with a single stage of policy training, essentially solving a classification
problem on the human preference data. The resulting algorithm, which we call
Direct Preference Optimization (DPO), is stable, performant and computationally
lightweight, eliminating the need for fitting a reward model, sampling from the
LM during fine-tuning, or performing significant hyperparameter tuning. Our
experiments show that DPO can fine-tune LMs to align with human preferences as
well as or better than existing methods. Notably, fine-tuning with DPO exceeds
RLHF's ability to control sentiment of generations and improves response
quality in summarization and single-turn dialogue while being substantially
simpler to implement and train.

本文提出了一种称为 DPO（Direct Preference Optimization）的算法来解决无监督语言模型中的可控性问题，并在实验中表明，相较于传统的 RLHF 方法，DPO 不仅表现更好，而且更加稳定和简单。

直接优化偏好：你的语言模型其实是一个暗地里的奖励模型

Direct Preference Optimization: Your Language Model is Secretly a Reward  Model

Recently research has started focusing on avoiding undesired effects that
come with content moderation, such as censorship and overblocking, when dealing
with hatred online. The core idea is to directly intervene in the discussion
with textual responses that are meant to counter the hate content and prevent
it from further spreading. Accordingly, automation strategies, such as natural
language generation, are beginning to be investigated. Still, they suffer from
the lack of sufficient amount of quality data and tend to produce
generic/repetitive responses. Being aware of the aforementioned limitations, we
present a study on how to collect responses to hate effectively, employing
large scale unsupervised language models such as GPT-2 for the generation of
silver data, and the best annotation strategies/neural architectures that can
be used for data filtering before expert validation/post-editing.

本研究旨在有效收集应对仇恨言论的响应，利用大规模的无监督语言模型生成银标注数据，并采用最佳注释策略 / 神经网络架构来进行专家验证 / 后期编辑。

生成针对在线仇恨言论的反叙事：数据与策略

Generating Counter Narratives against Online Hate Speech: Data and  Strategies

Text classification tends to be difficult when data are deficient or when it
is required to adapt to unseen classes. In such challenging scenarios, recent
studies have often used meta-learning to simulate the few-shot task, thus
negating implicit common linguistic features across tasks. This paper addresses
such problems using meta-learning and unsupervised language models. Our
approach is based on the insight that having a good generalization from a few
examples relies on both a generic model initialization and an effective
strategy for adapting this model to newly arising tasks. We show that our
approach is not only simple but also produces a state-of-the-art performance on
a well-studied sentiment classification dataset. It can thus be further
suggested that pretraining could be a promising solution for few-shot learning
of many other NLP tasks. The code and the dataset to replicate the experiments
are made available at this https URL

使用元学习和无监督语言模型解决数据不足或需要适应未知分类的文本分类难题，并在情感分类数据集上表现出最先进的性能，因此预训练可能是更多 NLP 任务的少样本学习的有前途的解决方案。