Large language models (LLMs) often generate content that contains factual
errors when responding to fact-seeking prompts on open-ended topics. To
benchmark a model's long-form factuality in open domains, we first use GPT-4 to
generate LongFact, a prompt set comprising thousands of questions spanning 38
topics. We then propose that LLM agents can be used as automated evaluators for
long-form factuality through a method which we call Search-Augmented Factuality
Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into
a set of individual facts and to evaluate the accuracy of each fact using a
multi-step reasoning process comprising sending search queries to Google Search
and determining whether a fact is supported by the search results. Furthermore,
we propose extending F1 score as an aggregated metric for long-form factuality.
To do so, we balance the percentage of supported facts in a response
(precision) with the percentage of provided facts relative to a hyperparameter
representing a user's preferred response length (recall).
Empirically, we demonstrate that LLM agents can achieve superhuman rating
performance - on a set of ~16k individual facts, SAFE agrees with crowdsourced
human annotators 72% of the time, and on a random subset of 100 disagreement
cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times
cheaper than human annotators. We also benchmark thirteen language models on
LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding
that larger language models generally achieve better long-form factuality.
LongFact, SAFE, and all experimental code are available at
this https URL

大型语言模型经常在对开放式主题的事实查询提示进行回答时产生内容错误。为了评估模型在开放领域中的长篇事实可靠性，我们首先使用 GPT-4 生成了一个包含 38000 个问题的长篇事实测试集，然后提出利用 LLM 代理作为长篇事实性的自动化评估器的方法（称为 SAFE），通过将长篇回复分解为一组单个事实，并使用多步推理过程（发送搜索查询到 Google 搜索并确定搜索结果是否支持事实）来评估每个事实的准确性。此外，我们提出将 F1 分数扩展为评估长篇事实性的聚合度量标准，通过将回复中的支持事实的百分比（准确率）与相对于用户首选回复长度的超参数表示的提供事实的百分比（召回率）进行平衡。实证上，我们证明 LLM 代理在超出人类标注者的 16k 个个别事实集上实现了超人类的评级性能 - SAFE 在这些事实中与众包人类标注者的意见达成 72% 的一致，在 100 个不一致案例的随机子集中，SAFE 赢得了 76% 的情况。与此同时，SAFE 比人类标注者便宜多达 20 倍。我们还对长篇事实测试集上的十三个语言模型进行了基准测试，涵盖四个模型系列（Gemini，GPT，Claude 和 PaLM-2），发现较大的语言模型通常可以实现更好的长篇事实性。LongFact，SAFE 和所有实验代码均可在此 https URL 中获取。

大型语言模型中的长篇事实准确性

Long-form factuality in large language models

Large language models (LLMs) have shown promise as automated evaluators for
assessing the quality of answers generated by AI systems. However, these
LLM-based evaluators exhibit position bias, or inconsistency, when used to
evaluate candidate answers in pairwise comparisons, favoring either the first
or second answer regardless of content. To address this limitation, we propose
PORTIA, an alignment-based system designed to mimic human comparison strategies
to calibrate position bias in a lightweight yet effective manner. Specifically,
PORTIA splits the answers into multiple segments, aligns similar content across
candidate answers, and then merges them back into a single prompt for
evaluation by LLMs. We conducted extensive experiments with six diverse LLMs to
evaluate 11,520 answer pairs. Our results show that PORTIA markedly enhances
the consistency rates for all the models and comparison forms tested, achieving
an average relative improvement of 47.46%. Remarkably, PORTIA enables less
advanced GPT models to achieve 88% agreement with the state-of-the-art GPT-4
model at just 10% of the cost. Furthermore, it rectifies around 80% of the
position bias instances within the GPT-4 model, elevating its consistency rate
up to 98%. Subsequent human evaluations indicate that the PORTIA-enhanced
GPT-3.5 model can even surpass the standalone GPT-4 in terms of alignment with
human evaluators. These findings highlight PORTIA's ability to correct position
bias, improve LLM consistency, and boost performance while keeping
cost-efficiency. This represents a valuable step toward a more reliable and
scalable use of LLMs for automated evaluations across diverse applications.

提出了一种名为 PORTIA 的系统，通过模拟人类比较策略来校准位置偏差，将多个候选答案的相似内容进行对齐并合并为一个问题进行大语言模型评估，实验证明 PORTIA 显著提高了所有模型的一致性，将费用降低至原来的 10%，并成功纠正了模型中大约 80% 的位置偏差，最终与人类评估者的结果相比超过了独立的 GPT-4 模型，强调了 PORTIA 纠正位置偏差、提高 LLM 一致性和性能的能力。