Dialogue summarization is abstractive in nature, making it suffer from
factual errors. The factual correctness of summaries has the highest priority
before practical applications. Many efforts have been made to improve
faithfulness in text summarization. However, there is a lack of systematic
study on dialogue summarization systems. In this work, we first perform the
fine-grained human analysis on the faithfulness of dialogue summaries and
observe that over 35% of generated summaries are faithfully inconsistent
respective the source dialogues. Furthermore, we present a new model-level
faithfulness evaluation method. It examines generation models with multi-choice
questions created by rule-based transformations. Experimental results show that
our evaluation schema is a strong proxy for the factual correctness of
summarization models. The human-annotated faithfulness samples and the
evaluation toolkit are released to facilitate future research toward faithful
dialogue summarization.

本文提出了一种系统评估对话摘要的方法，并通过人工分析发现现有模型中有超过 35% 生成的摘要与原始对话不符，其评估工具与样本数据可用于未来的研究。

对话摘要中忠实度的分析和评估

Analyzing and Evaluating Faithfulness in Dialogue Summarization

Feature attribution a.k.a. input salience methods which assign an importance
score to a feature are abundant but may produce surprisingly different results
for the same model on the same input. While differences are expected if
disparate definitions of importance are assumed, most methods claim to provide
faithful attributions and point at the features most relevant for a model's
prediction. Existing work on faithfulness evaluation is not conclusive and does
not provide a clear answer as to how different methods are to be compared.
Focusing on text classification and the model debugging scenario, our main
contribution is a protocol for faithfulness evaluation that makes use of
partially synthetic data to obtain ground truth for feature importance ranking.
Following the protocol, we do an in-depth analysis of four standard salience
method classes on a range of datasets and shortcuts for BERT and LSTM models
and demonstrate that some of the most popular method configurations provide
poor results even for simplest shortcuts. We recommend following the protocol
for each new task and model combination to find the best method for identifying
shortcuts.

本文提出了一种研究模型特征重要性的方法，并对四种标准输入明显性方法在文本分类和模型调试场景下的效果进行了深入分析，并建议针对每种新任务 - 模型组合采用该方法以找到找到最佳的快捷方式识别方法。

评估文本分类输入显著性方法的忠实度协议：您能发现这些捷径吗？

"Will You Find These Shortcuts?" A Protocol for Evaluating the  Faithfulness of Input Salience Methods for Text Classification

Neural abstractive summarization models are prone to generate content
inconsistent with the source document, i.e. unfaithful. Existing automatic
metrics do not capture such mistakes effectively. We tackle the problem of
evaluating faithfulness of a generated summary given its source document. We
first collected human annotations of faithfulness for outputs from numerous
models on two datasets. We find that current models exhibit a trade-off between
abstractiveness and faithfulness: outputs with less word overlap with the
source document are more likely to be unfaithful. Next, we propose an automatic
question answering (QA) based metric for faithfulness, FEQA, which leverages
recent advances in reading comprehension. Given question-answer pairs generated
from the summary, a QA model extracts answers from the document; non-matched
answers indicate unfaithful information in the summary. Among metrics based on
word overlap, embedding similarity, and learned language understanding models,
our QA-based metric has significantly higher correlation with human
faithfulness scores, especially on highly abstractive summaries.

该研究提出了一种基于自动问答的 faithfulness 评估度量方式（FEQA），并发现当前的神经抽象概括模型存在抽象度和忠实度的权衡关系。