Despite the rising popularity of saliency-based explanations, the research
community remains at an impasse, facing doubts concerning their purpose,
efficacy, and tendency to contradict each other. Seeking to unite the
community's efforts around common goals, several recent works have proposed
evaluation metrics. In this paper, we critically examine two sets of metrics:
the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics,
focusing our inquiry on natural language processing. First, we show that we can
inflate a model's comprehensiveness and sufficiency scores dramatically without
altering its predictions or explanations on in-distribution test inputs. Our
strategy exploits the tendency for extracted explanations and their complements
to be "out-of-support" relative to each other and in-distribution inputs. Next,
we demonstrate that the EVAL-X metrics can be inflated arbitrarily by a simple
method that encodes the label, even though EVAL-X is precisely motivated to
address such exploits. Our results raise doubts about the ability of current
metrics to guide explainability research, underscoring the need for a broader
reassessment of what precisely these metrics are intended to capture.

对于基于显著性的解释方法的研究，存在关于其目的、有效性和相互抵触性的疑虑。本文针对自然语言处理进行了关于评估指标的批判性研究，对两套指标进行了评估，并展示了目前的指标能否准确引导可解释性研究的能力存在疑问，强调有必要对这些指标所要捕捉的内容进行更广泛的重新评估。

Goodhart 定律在 NLP 的解释基准中适用

Goodhart's Law Applies to NLP's Explanation Benchmarks

Long-form question answering (LFQA) enables answering a wide range of
questions, but its flexibility poses enormous challenges for evaluation. We
perform the first targeted study of the evaluation of long-form answers,
covering both human and automatic evaluation practices. We hire domain experts
in seven areas to provide preference judgments over pairs of answers, along
with free-form justifications for their choices. We present a careful analysis
of experts' evaluation, which focuses on new aspects such as the
comprehensiveness of the answer. Next, we examine automatic text generation
metrics, finding that no existing metrics are predictive of human preference
judgments. However, some metrics correlate with fine-grained aspects of answers
(e.g., coherence). We encourage future work to move away from a single "overall
score" of the answer and adopt a multi-faceted evaluation, targeting aspects
such as factuality and completeness. We publicly release all of our annotations
and code to spur future work into LFQA evaluation.

对长篇答案进行有针对性的评估研究，强调评估多维度因素，发现自动文本生成的评价指标不能预测人类喜好，建议未来的评估中，应该注重准确性、完整性和客观性等多个方面。

对长篇问答评估的关键评估

A Critical Evaluation of Evaluations for Long-form Question Answering

Interpretable machine learning has gained much attention recently. Briefness
and comprehensiveness are necessary in order to provide a large amount of
information concisely when explaining a black-box decision system. However,
existing interpretable machine learning methods fail to consider briefness and
comprehensiveness simultaneously, leading to redundant explanations. We propose
the variational information bottleneck for interpretation, VIBI, a
system-agnostic interpretable method that provides a brief but comprehensive
explanation. VIBI adopts an information theoretic principle, information
bottleneck principle, as a criterion for finding such explanations. For each
instance, VIBI selects key features that are maximally compressed about an
input (briefness), and informative about a decision made by a black-box system
on that input (comprehensive). We evaluate VIBI on three datasets and compare
with state-of-the-art interpretable machine learning methods in terms of both
interpretability and fidelity evaluated by human and quantitative metrics

本研究提出了一种系统无关的解释模式，即采用信息瓶颈原理作为准则来寻找在简洁性和全面性方面都具备的关键特征，并在三个数据集上评估了其可解释性和保真度。