Recently proposed long-form question answering (QA) systems, supported by
large language models (LLMs), have shown promising capabilities. Yet,
attributing and verifying their generated abstractive answers can be difficult,
and automatically evaluating their accuracy remains an ongoing challenge.
In this work, we introduce a new QA task for answering multi-answer questions
by summarizing multiple diverse sources in a semi-extractive fashion.
Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output
a comprehensive answer, while mixing factual quoted spans -- copied verbatim
from given input sources -- and non-factual free-text connectors that glue
these spans together into a single cohesive passage. This setting bridges the
gap between the outputs of well-grounded but constrained extractive QA systems
and more fluent but harder to attribute fully abstractive answers.
Particularly, it enables a new mode for language models that leverages their
advanced language generation capabilities, while also producing fine in-line
attributions by-design that are easy to verify, interpret, and evaluate.
To study this task, we create the first dataset of this kind, QuoteSum, with
human-written semi-extractive answers to natural and generated questions, and
define text-based evaluation metrics. Experimenting with several LLMs in
various settings, we find this task to be surprisingly challenging,
demonstrating the importance of QuoteSum for developing and studying such
consolidation capabilities.

最近提出的长篇问答（QA）系统，在大型语言模型（LLMs）的支持下，展示了令人期待的能力。然而，为其生成的抽象回答归因和验证可能困难，并且自动评估其准确性仍然是一个持续的挑战。在这项工作中，我们介绍了一个新的 QA 任务，通过半抽取方式总结多个多样化的来源来回答多回答问题。具体来说，半抽取多源 QA（SEMQA）要求模型输出一个综合回答，同时混合了由给定的输入来源直接拷贝的事实引用片段和将这些片段连接成一个连贯段落的非事实自由文本连接器。这个设置弥合了受基于事实抽取的 QA 系统约束的输出与更流畅但更难以归因的完全抽象回答之间的差距。特别地，它利用了语言模型的先进语言生成能力的新模式，同时通过设计产生易于验证、解释和评估的细致内联归因。为了研究这个任务，我们创建了第一个这样类型的数据集 QuoteSum，其中包含人工编写的对自然问题和生成问题的半抽取回答，并定义了基于文本的评估指标。在不同设置下尝试了几个 LLM 后，我们发现这个任务出人意料地具有挑战性，这展示了 QuoteSum 用于开发和研究这种整合能力的重要性。

SEMQA: 半抽取式多源问答

SEMQA: Semi-Extractive Multi-Source Question Answering

Tasks involving text generation based on multiple input texts, such as
multi-document summarization, long-form question answering and contemporary
dialogue applications, challenge models for their ability to properly
consolidate partly-overlapping multi-text information. However, these tasks
entangle the consolidation phase with the often subjective and ill-defined
content selection requirement, impeding proper assessment of models'
consolidation capabilities. In this paper, we suggest revisiting the sentence
union generation task as an effective well-defined testbed for assessing text
consolidation capabilities, decoupling the consolidation challenge from
subjective content selection. To support research on this task, we present
refined annotation methodology and tools for crowdsourcing sentence union,
create the largest union dataset to date and provide an analysis of its rich
coverage of various consolidation aspects. We then propose a comprehensive
evaluation protocol for union generation, including both human and automatic
evaluation. Finally, as baselines, we evaluate state-of-the-art language models
on the task, along with a detailed analysis of their capacity to address
multi-text consolidation challenges and their limitations.

本文提出将句子联合生成任务作为一种有效的明确定义的测试基准，以评估文本合并能力，消除了主观内容选择的影响。针对该任务，我们提出了一套细化的注释方法和众包工具，创建了迄今最大的联合数据集，并提供了多种合并方面的丰富分析。最后，我们对最先进的语言模型进行了基线评估，并对它们解决多文本合并挑战的能力及其局限性进行了详细分析。