Generative language models produce highly abstractive outputs by design, in
contrast to extractive responses in search engines. Given this characteristic
of LLMs and the resulting implications for content Licensing & Attribution, we
propose the the so-called Extractive-Abstractive axis for benchmarking
generative models and highlight the need for developing corresponding metrics,
datasets and annotation guidelines. We limit our discussion to the text
modality.

生成性语言模型的特性，对内容授权和归属产生了影响，因此我们提出提取 - 摘要轴用于评估生成模型，并强调开发相应的度量标准、数据集和注释指南的需求。我们限制讨论于文本模态。

提取 - 生成轴：衡量生成式语言模型中内容 “借用

The Extractive-Abstractive Axis: Measuring Content "Borrowing" in  Generative Language Models

(Tack et al., 2023) organized the shared task hosted by the 18th Workshop on
Innovative Use of NLP for Building Educational Applications on generation of
teacher language in educational dialogues. Following the structure of the
shared task, in this study, we attempt to assess the generative abilities of
large language models in providing informative and helpful insights to
students, thereby simulating the role of a knowledgeable teacher. To this end,
we present an extensive evaluation of several benchmarking generative models,
including GPT-4 (few-shot, in-context learning), fine-tuned GPT-2, and
fine-tuned DialoGPT. Additionally, to optimize for pedagogical quality, we
fine-tuned the Flan-T5 model using reinforcement learning. Our experimental
findings on the Teacher-Student Chatroom Corpus subset indicate the efficacy of
GPT-4 over other fine-tuned models, measured using BERTScore and DialogRPT.
We hypothesize that several dataset characteristics, including sampling,
representativeness, and dialog completeness, pose significant challenges to
fine-tuning, thus contributing to the poor generalizability of the fine-tuned
models. Finally, we note the need for these generative models to be evaluated
with a metric that relies not only on dialog coherence and matched language
modeling distribution but also on the model's ability to showcase pedagogical
skills.

通过评估多个基准生成模型在教育对话中提供信息和帮助学生的能力，本研究旨在模拟一个有知识的老师的角色，并发现 GPT-4 在教师 - 学生聊天记录子集上的优越性，测量标准是 BERTScore 和 DialogRPT，同时注意到采样、代表性和对话完整性等数据集特征对微调模型的一般化能力造成了显著挑战，最终强调了对这些生成模型进行评估的需求，其中评估标准不仅依赖于对话连贯性和匹配的语言建模分布，而且还依赖于模型展示教学技巧的能力。