Pairwise human judgments are pivotal in guiding large language models (LLMs)
to generate outputs that align with human preferences. They are also often used
in summarization evaluation, complementing existing automatic metrics. Despite
their significance, however, there has been limited research probing these
pairwise human judgments. The collective impact and respective weights of
factors such as informativeness, coherence, fluency, and factual consistency
remain elusive. The impact of hidden factors on the final judgment is also
unclear. In this paper, we conduct an in-depth examination of a dataset of
pairwise human judgments released by OpenAI. Utilizing the Bradley-Terry-Luce
model, we identify key factors that could potentially influence human
judgments. Our research uncovers the inherent preferences embedded in human
judgments and suggests strategies to boost sample efficiency. Finally, we
provide insights on the construction of balanced datasets for human judgment
evaluations, a crucial step in shaping the behaviors of future LLMs.

本文通过利用 Bradley-Terry-Luce 模型对 OpenAI 发布的配对人类判断数据集进行深入研究，探讨影响人类判断的关键因素，揭示了人类判断中的内在偏好，并提出了提高样本效率的策略。最后，本文对于人类判断评估中平衡数据集的构建提供了洞见，这是塑造未来 LLMs 行为的关键步骤。

通过 GPT-4 分析人类偏好判断的影响因素

Analyzing Influential Factors in Human Preference Judgments via GPT-4

Adopting contextually appropriate, audience-tailored linguistic styles is
critical to the success of user-centric language generation systems (e.g.,
chatbots, computer-aided writing, dialog systems). While existing approaches
demonstrate textual style transfer with large volumes of parallel or
non-parallel data, we argue that grounding style on audience-independent
external factors is innately limiting for two reasons. First, it is difficult
to collect large volumes of audience-specific stylistic data. Second, some
stylistic objectives (e.g., persuasiveness, memorability, empathy) are hard to
define without audience feedback.
In this paper, we propose the novel task of style infusion - infusing the
stylistic preferences of audiences in pretrained language generation models.
Since humans are better at pairwise comparisons than direct scoring - i.e., is
Sample-A more persuasive/polite/empathic than Sample-B - we leverage limited
pairwise human judgments to bootstrap a style analysis model and augment our
seed set of judgments. We then infuse the learned textual style in a GPT-2
based text generator while balancing fluency and style adoption. With
quantitative and qualitative assessments, we show that our infusion approach
can generate compelling stylized examples with generic text prompts. The code
and data are accessible at this https URL.

本文提出了风格注入的新任务，旨在将样本的风格偏好融入到预训练语言生成模型中，以生成具备风格的文本。通过有限的人工判断，我们的方法可以为风格分析模型提供数据并增强其样本集，同时平衡流畅性和风格采用。实验结果表明，我们的注入方法可以生成具有吸引力的风格化样例。