Learning from human feedback is a prominent technique to align the output of
large language models (LLMs) with human expectations. Reinforcement learning
from human feedback (RLHF) leverages human preference signals that are in the
form of ranking of response pairs to perform this alignment. However, human
preference on LLM outputs can come in much richer forms including natural
language, which may provide detailed feedback on strengths and weaknesses of a
given response. In this work we investigate data efficiency of modeling human
feedback that is in natural language. Specifically, we fine-tune an open-source
LLM, e.g., Falcon-40B-Instruct, on a relatively small amount (1000 records or
even less) of human feedback in natural language in the form of critiques and
revisions of responses. We show that this model is able to improve the quality
of responses from even some of the strongest LLMs such as ChatGPT, BARD, and
Vicuna, through critique and revision of those responses. For instance, through
one iteration of revision of ChatGPT responses, the revised responses have
56.6% win rate over the original ones, and this win rate can be further
improved to 65.9% after applying the revision for five iterations.

通过模型中人类反馈的学习，改进大型语言模型（LLMs）的输出与人类期望的一致性，利用人类反馈信号中以响应对的排名形式的强化学习，研究使用自然语言反馈模型的数据效率，通过对 ChatGPT、BARD 和 Vicuna 等模型的反馈逐渐改进，提高了模型的响应质量。