The generation of undesirable and factually incorrect content of large
language models poses a significant challenge and remains largely an unsolved
issue. This paper studies the integration of a contrastive learning objective
for fine-tuning LLMs for implicit knowledge editing and controlled text
generation. Optimizing the training objective entails aligning text
perplexities in a contrastive fashion. To facilitate training the model in a
self-supervised fashion, we leverage an off-the-shelf LLM for training data
generation. We showcase applicability in the domain of detoxification. Herein,
the proposed approach leads to a significant decrease in the generation of
toxic content while preserving general utility for downstream tasks such as
commonsense reasoning and reading comprehension. The proposed approach is
conceptually simple but empirically powerful.

研究通过对大型语言模型进行对比学习目标的整合，以实现隐式知识编辑和受控文本生成，从而解决生成不受欢迎和事实不正确的内容的问题。该方法在自毁训练方式的基础上，通过利用现成的语言模型进行数据生成，成功降低了生成有毒内容的频率，并在通用任务（如常识推理和阅读理解）中保持了模型的实用性。该方法简单且实践有效。

对比困惑度与受控生成：在去毒化大型语言模型中的应用

Contrastive Perplexity for Controlled Generation: An Application in  Detoxifying Large Language Models

Deep generative models are known to produce undesirable samples such as
harmful content. Traditional mitigation methods include re-training from
scratch, filtering, or editing; however, these are either computationally
expensive or can be circumvented by third parties. In this paper, we take a
different approach and study how to post-edit an already-trained conditional
generative model so that it redacts certain conditionals that will, with high
probability, lead to undesirable content. This is done by distilling the
conditioning network in the models, giving a solution that is effective,
efficient, controllable, and universal for a class of deep generative models.
We conduct experiments on redacting prompts in text-to-image models and
redacting voices in text-to-speech models. Our method is computationally light,
leads to better redaction quality and robustness than baseline methods while
still retaining high generation quality.

本文研究了如何在已经训练好的条件生成模型上进行后编辑，以消除某些条件性，从而以较高的概率消除不良内容，这是通过提取模型中的条件网络实现的，该方法在保持高生成质量的同时，计算轻便，用于深度生成模型的类别具有普适性、高效性和可控性，实验结果表明此方法在文本到图像模型和文本到语音模型上的效果较基线方法更好，鲁棒性更强。

有条件生成模型的数据遮蔽

Data Redaction from Conditional Generative Models

Language models (LMs) are pretrained to imitate internet text, including
content that would violate human preferences if generated by an LM: falsehoods,
offensive comments, personally identifiable information, low-quality or buggy
code, and more. Here, we explore alternative objectives for pretraining LMs in
a way that also guides them to generate text aligned with human preferences. We
benchmark five objectives for pretraining with human feedback across three
tasks and study how they affect the trade-off between alignment and
capabilities of pretrained LMs. We find a Pareto-optimal and simple approach
among those we explored: conditional training, or learning distribution over
tokens conditional on their human preference scores given by a reward model.
Conditional training reduces the rate of undesirable content by up to an order
of magnitude, both when generating without a prompt and with an
adversarially-chosen prompt. Moreover, conditional training maintains the
downstream task performance of standard LM pretraining, both before and after
task-specific finetuning. Pretraining with human feedback results in much
better preference satisfaction than standard LM pretraining followed by
finetuning with feedback, i.e., learning and then unlearning undesirable
behavior. Our results suggest that we should move beyond imitation learning
when pretraining LMs and incorporate human preferences from the start of
training.

通过在预训练中引入人类的反馈，实现对于语言模型的生成文本的可控和可导向性，减少哪些偏离人类喜好的内容的生成，并且在标准的预训练和任务特定的微调中保持下游任务表现。推荐在训练开始时，就结合人类反馈，不再使用模仿学习的方式预训练语言模型。