Reward models (RM) play a critical role in aligning language models through
the process of reinforcement learning from human feedback. RMs are trained to
predict a score reflecting human preference, which requires significant time
and cost for human annotation. Additionally, RMs tend to quickly overfit on
superficial features in the training set, hindering their generalization
performance on unseen distributions. We propose a novel approach using
synthetic natural language critiques generated by large language models to
provide additional feedback, evaluating aspects such as instruction following,
correctness, and style. This offers richer signals and more robust features for
RMs to assess and score on. We demonstrate that high-quality critiques improve
the performance and data efficiency of RMs initialized from different
pretrained models. Conversely, we also show that low-quality critiques
negatively impact performance. Furthermore, incorporating critiques enhances
the interpretability and robustness of RM training.

利用大型语言模型生成的合成自然语言评论提供额外的反馈，评估指令遵循、正确性和风格等方面，提供更丰富的信号和更强大的特征，从而提高奖励模型的性能和数据效率，同时增强了奖励模型训练的可解释性和鲁棒性。

改进奖励模型通过合成批评

Improving Reward Models with Synthetic Critiques

Controlled automated story generation seeks to generate natural language
stories satisfying constraints from natural language critiques or preferences.
Existing methods to control for story preference utilize prompt engineering
which is labor intensive and often inconsistent. They may also use
logit-manipulation methods which require annotated datasets to exist for the
desired attributes. To address these issues, we first train a contrastive
bi-encoder model to align stories with corresponding human critiques, named
CARP, building a general purpose preference model. This is subsequently used as
a reward function to fine-tune a generative language model via reinforcement
learning. However, simply fine-tuning a generative language model with a
contrastive reward model does not always reliably result in a story generation
system capable of generating stories that meet user preferences. To increase
story generation robustness we further fine-tune the contrastive reward model
using a prompt-learning technique. A human participant study is then conducted
comparing generations from our full system, ablations, and two baselines. We
show that the full fine-tuning pipeline results in a story generator preferred
over a LLM 20x as large as well as logit-based methods. This motivates the use
of contrastive learning for general purpose human preference modeling.

使用对抗式生成模型和强化学习算法，本论文提出了一种新型的人工智能故事生成系统，能够根据人类喜好和偏好生成自然语言故事。

通过对比强化学习实现故事讲述的鲁棒性偏好学习

Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning

We fine-tune large language models to write natural language critiques
(natural language critical comments) using behavioral cloning. On a topic-based
summarization task, critiques written by our models help humans find flaws in
summaries that they would have otherwise missed. Our models help find naturally
occurring flaws in both model and human written summaries, and intentional
flaws in summaries written by humans to be deliberately misleading. We study
scaling properties of critiquing with both topic-based summarization and
synthetic tasks. Larger models write more helpful critiques, and on most tasks,
are better at self-critiquing, despite having harder-to-critique outputs.
Larger models can also integrate their own self-critiques as feedback, refining
their own summaries into better ones. Finally, we motivate and introduce a
framework for comparing critiquing ability to generation and discrimination
ability. Our measurements suggest that even large models may still have
relevant knowledge they cannot or do not articulate as critiques. These results
are a proof of concept for using AI-assisted human feedback to scale the
supervision of machine learning systems to tasks that are difficult for humans
to evaluate directly. We release our training datasets, as well as samples from
our critique assistance experiments.

本文介绍了利用大型语言模型进行自然语言批判的方法，帮助人们更有效地检测摘要中的问题，并着重研究了批判能力的缩放特性和与生成能力和辨别能力的比较，为机器学习系统的监督提供了 AI 辅助人类反馈的概念证明。