Language models (LMs) often exhibit undesirable text generation behaviors,
including generating false, toxic, or irrelevant outputs. Reinforcement
learning from human feedback (RLHF) - where human preference judgments on LM
outputs are transformed into a learning signal - has recently shown promise in
addressing these issues. However, such holistic feedback conveys limited
information on long text outputs; it does not indicate which aspects of the
outputs influenced user preference; e.g., which parts contain what type(s) of
errors. In this paper, we use fine-grained human feedback (e.g., which sentence
is false, which sub-sentence is irrelevant) as an explicit training signal. We
introduce Fine-Grained RLHF, a framework that enables training and learning
from reward functions that are fine-grained in two respects: (1) density,
providing a reward after every segment (e.g., a sentence) is generated; and (2)
incorporating multiple reward models associated with different feedback types
(e.g., factual incorrectness, irrelevance, and information incompleteness). We
conduct experiments on detoxification and long-form question answering to
illustrate how learning with such reward functions leads to improved
performance, supported by both automatic and human evaluation. Additionally, we
show that LM behaviors can be customized using different combinations of
fine-grained reward models. We release all data, collected human feedback, and
codes at this https URL

本文介绍了 Fine-Grained RLHF 框架，可以对包含一定程度错误或无效信息的长文本提供细化的人类反馈进行训练，并实验表明使用该框架能够改善语言模型生成过程中生成虚假、有毒、无关的输出等问题。

细粒度人类反馈为语言模型训练提供更好的奖励

Fine-Grained Human Feedback Gives Better Rewards for Language Model  Training

We propose a method to control the attributes of Language Models (LMs) for
the text generation task using Causal Average Treatment Effect (ATE) scores and
counterfactual augmentation. We explore this method, in the context of LM
detoxification, and propose the Causally Fair Language (CFL) architecture for
detoxifying pre-trained LMs in a plug-and-play manner. Our architecture is
based on a Structural Causal Model (SCM) that is mathematically transparent and
computationally efficient as compared with many existing detoxification
techniques. We also propose several new metrics that aim to better understand
the behaviour of LMs in the context of toxic text generation. Further, we
achieve state of the art performance for toxic degeneration, which are computed
using \RTP (RTP) benchmark. Our experiments show that CFL achieves such a
detoxification without much impact on the model perplexity. We also show that
CFL mitigates the unintended bias problem through experiments on the BOLD
dataset.

使用因果平均处理效应（ATE）分数和反事实增强作为文本生成任务语言模型（LMs）属性控制的方法，我们提出了因果公平语言（CFL）架构，以插入并播放的方式解毒预训练 LMs。我们的实验表明，CFL 实现了这种解毒而不会对模型困惑度产生太大影响，并通过对 BOLD 数据集的实验表明，CFL 可以缓解意外偏见问题。