Concerns regarding the propensity of Large Language Models (LLMs) to produce
inaccurate outputs, also known as hallucinations, have escalated. Detecting
them is vital for ensuring the reliability of applications relying on
LLM-generated content. Current methods often demand substantial resources and
rely on extensive LLMs or employ supervised learning with multidimensional
features or intricate linguistic and semantic analyses difficult to reproduce
and largely depend on using the same LLM that hallucinated. This paper
introduces a supervised learning approach employing two simple classifiers
utilizing only four numerical features derived from tokens and vocabulary
probabilities obtained from other LLM evaluators, which are not necessarily the
same. The method yields promising results, surpassing state-of-the-art outcomes
in multiple tasks across three different benchmarks. Additionally, we provide a
comprehensive examination of the strengths and weaknesses of our approach,
highlighting the significance of the features utilized and the LLM employed as
an evaluator. We have released our code publicly at
this https URL

使用两个简单的分类器和从其他 LLM 评估器获得的四个数值特征，本文引入了一种监督学习方法，取得了有希望的结果，并在三个不同基准测试中超越了当前最先进的成果。

大规模语言模型生成中的幻觉检测：基于词元概率的方法

Detecting Hallucinations in Large Language Model Generation: A Token  Probability Approach

The zero-shot capability of Large Language Models (LLMs) has enabled highly
flexible, reference-free metrics for various tasks, making LLM evaluators
common tools in NLP. However, the robustness of these LLM evaluators remains
relatively understudied; existing work mainly pursued optimal performance in
terms of correlating LLM scores with human expert scores. In this paper, we
conduct a series of analyses using the SummEval dataset and confirm that LLMs
are biased evaluators as they: (1) exhibit familiarity bias-a preference for
text with lower perplexity, (2) show skewed and biased distributions of
ratings, and (3) experience anchoring effects for multi-attribute judgments. We
also found that LLMs are inconsistent evaluators, showing low "inter-sample"
agreement and sensitivity to prompt differences that are insignificant to human
understanding of text quality. Furthermore, we share recipes for configuring
LLM evaluators to mitigate these limitations. Experimental results on the RoSE
dataset demonstrate improvements over the state-of-the-art LLM evaluators.

本研究通过使用 SummEval 数据集进行一系列分析，证实了大型语言模型作为评估器在以下方面存在偏见和不一致性：（1）体现对低困惑度文本的偏好；（2）显示具有偏见的评分分布；（3）经历多属性判断时的锚定效应。此外，我们分享了配置大型语言模型评估器以减轻这些限制的方法，通过 RoSE 数据集的实验证明了与最先进的大型语言模型评估器相比的改进。

大型语言模型的评估存在不一致和偏见

Large Language Models are Inconsistent and Biased Evaluators

As research in large language models (LLMs) continues to accelerate,
LLM-based evaluation has emerged as a scalable and cost-effective alternative
to human evaluations for comparing the ever increasing list of models. This
paper investigates the efficacy of these "LLM evaluators", particularly in
using them to assess instruction following, a metric that gauges how closely
generated text adheres to the given instruction. We introduce a challenging
meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM
evaluator in discerning instruction-following outputs. The authors manually
curated 419 pairs of outputs, one adhering to instructions while the other
diverging, yet may possess deceptive qualities that mislead an LLM evaluator,
e.g., a more engaging tone. Contrary to existing meta-evaluation, we discover
that different evaluators (i.e., combinations of LLMs and prompts) exhibit
distinct performance on LLMBar and even the highest-scoring ones have
substantial room for improvement. We also present a novel suite of prompting
strategies that further close the gap between LLM and human evaluators. With
LLMBar, we hope to offer more insight into LLM evaluators and foster future
research in developing better instruction-following models.

这篇研究通过引入一个具有挑战性的元评估基准 LMMBar，调查了大型语言模型 (LLMs) 在评估指导遵循生成文本方面的效力，发现不同评估器对 LMMBar 的性能表现不同，最高分的评估器仍有改进的空间，并提出了一套新颖的提示策略来缩小 LLM 和人类评估器之间的差距。通过 LLMBar 希望提供对 LLM 评估器的更多洞察，并促进未来开发更好的指导遵循模型的研究。