Sentence simplification, which rewrites a sentence to be easier to read and
understand, is a promising technique to help people with various reading
difficulties. With the rise of advanced large language models (LLMs),
evaluating their performance in sentence simplification has become imperative.
Recent studies have used both automatic metrics and human evaluations to assess
the simplification abilities of LLMs. However, the suitability of existing
evaluation methodologies for LLMs remains in question. First, the suitability
of current automatic metrics on LLMs' simplification evaluation is still
uncertain. Second, current human evaluation approaches in sentence
simplification often fall into two extremes: they are either too superficial,
failing to offer a clear understanding of the models' performance, or overly
detailed, making the annotation process complex and prone to inconsistency,
which in turn affects the evaluation's reliability. To address these problems,
this study provides in-depth insights into LLMs' performance while ensuring the
reliability of the evaluation. We design an error-based human annotation
framework to assess the GPT-4's simplification capabilities. Results show that
GPT-4 generally generates fewer erroneous simplification outputs compared to
the current state-of-the-art. However, LLMs have their limitations, as seen in
GPT-4's struggles with lexical paraphrasing. Furthermore, we conduct
meta-evaluations on widely used automatic metrics using our human annotations.
We find that while these metrics are effective for significant quality
differences, they lack sufficient sensitivity to assess the overall
high-quality simplification by GPT-4.

通过设计错误基础的人类注释框架来评估 GPT-4 在句子简化方面的能力，进一步深入了解大型语言模型的性能，同时确保评估的可靠性。该研究发现 GPT-4 相对于现有最先进的模型来说，普遍生成较少错误的简化输出，但在词汇转述方面仍然存在限制。此外，我们对广泛使用的自动评估指标进行了元评估，发现这些指标在评估 GPT-4 的高质量简化整体能力上缺乏足够的敏感性。

基于错误人工评估的 GPT-4 在句子简化中的深入评估

An In-depth Evaluation of GPT-4 in Sentence Simplification with  Error-based Human Assessment

In-Context Learning (ICL) is an emergent capability of Large Language Models
(LLMs). Only a few demonstrations enable LLMs to be used as blackbox for new
tasks. Previous studies have shown that using LLMs' outputs as labels is
effective in training models to select demonstrations. Such a label is expected
to estimate utility of a demonstration in ICL; however, it has not been well
understood how different labeling strategies affect results on target tasks.
This paper presents an analysis on different utility functions by focusing on
LLMs' output probability given ground-truth output, and task-specific reward
given LLMs' prediction. Unlike the previous work, we introduce a novel labeling
method, incremental utility, which estimates how much incremental knowledge is
brought into the LLMs by a demonstration. We conduct experiments with
instruction-tuned LLMs on binary/multi-class classification, segmentation, and
translation across Arabic, English, Finnish, Japanese, and Spanish. Our results
show that (1) the probability is effective when the probability values are
distributed across the whole value range (on the classification tasks), and (2)
the downstream metric is more robust when nuanced reward values are provided
with long outputs (on the segmentation and translation tasks). We then show
that the proposed incremental utility further helps ICL by contrasting how the
LLMs perform with and without the demonstrations.

该文研究了大型语言模型在上下文学习中的效应，探讨了不同的标签策略对目标任务结果的影响，并提出了一种新颖的标签方法 —— 增量效用，实验证明该方法有效地提升了大型语言模型的性能。