In order to oversee advanced AI systems, it is important to understand their
underlying decision-making process. When prompted, large language models (LLMs)
can provide natural language explanations or reasoning traces that sound
plausible and receive high ratings from human annotators. However, it is
unclear to what extent these explanations are faithful, i.e., truly capture the
factors responsible for the model's predictions. In this work, we introduce
Correlational Explanatory Faithfulness (CEF), a metric that can be used in
faithfulness tests based on input interventions. Previous metrics used in such
tests take into account only binary changes in the predictions. Our metric
accounts for the total shift in the model's predicted label distribution, more
accurately reflecting the explanations' faithfulness. We then introduce the
Correlational Counterfactual Test (CCT) by instantiating CEF on the
Counterfactual Test (CT) from Atanasova et al. (2023). We evaluate the
faithfulness of free-text explanations generated by few-shot-prompted LLMs from
the Llama2 family on three NLP tasks. We find that our metric measures aspects
of faithfulness which the CT misses.

评估了 Llama2 系列的少样本提示 LLMs 生成的自由文本解释在三个 NLP 任务上的忠实性，并发现我们的度量考虑了 CT 遗漏的忠实性方面。

概率也很重要：大型语言模型中自由文本解释的忠实度的更为准确的评估指标

The Probabilities Also Matter: A More Faithful Metric for Faithfulness  of Free-Text Explanations in Large Language Models

Large language models (LLMs) can explain their own predictions, through
post-hoc or Chain-of-Thought (CoT) explanations. However the LLM could make up
reasonably sounding explanations that are unfaithful to its underlying
reasoning. Recent work has designed tests that aim to judge the faithfulness of
either post-hoc or CoT explanations. In this paper we argue that existing
faithfulness tests are not actually measuring faithfulness in terms of the
models' inner workings, but only evaluate their self-consistency on the output
level. The aims of our work are two-fold. i) We aim to clarify the status of
existing faithfulness tests in terms of model explainability, characterising
them as self-consistency tests instead. This assessment we underline by
constructing a Comparative Consistency Bank for self-consistency tests that for
the first time compares existing tests on a common suite of 11 open-source LLMs
and 5 datasets -- including ii) our own proposed self-consistency measure
CC-SHAP. CC-SHAP is a new fine-grained measure (not test) of LLM
self-consistency that compares a model's input contributions to answer
prediction and generated explanation. With CC-SHAP, we aim to take a step
further towards measuring faithfulness with a more interpretable and
fine-grained method. Code available at
https://github.com/Heidelberg-NLP/CC-SHAP

大型语言模型 (LLMs) 可以通过后期或思维链 (CoT) 解释自己的预测结果，但模型可能会提供合理但不准确的解释。本文对现有的忠诚度测试进行了评估，认为这些测试实际上只测量了模型输出的自我一致性，而非其内部工作的忠诚度。作者提出了基于自我一致性的新测量 CC-SHAP，通过比较模型的输入贡献与答案预测及生成解释之间的一致性，从而更准确地衡量模型的忠诚度。

关于衡量自然语言解释的可信度

On Measuring Faithfulness of Natural Language Explanations

While enjoying the great achievements brought by deep learning (DL), people
are also worried about the decision made by DL models, since the high degree of
non-linearity of DL models makes the decision extremely difficult to
understand. Consequently, attacks such as adversarial attacks are easy to carry
out, but difficult to detect and explain, which has led to a boom in the
research on local explanation methods for explaining model decisions. In this
paper, we evaluate the faithfulness of explanation methods and find that
traditional tests on faithfulness encounter the random dominance problem, \ie,
the random selection performs the best, especially for complex data. To further
solve this problem, we propose three trend-based faithfulness tests and
empirically demonstrate that the new trend tests can better assess faithfulness
than traditional tests on image, natural language and security tasks. We
implement the assessment system and evaluate ten popular explanation methods.
Benefiting from the trend tests, we successfully assess the explanation methods
on complex data for the first time, bringing unprecedented discoveries and
inspiring future research. Downstream tasks also greatly benefit from the
tests. For example, model debugging equipped with faithful explanation methods
performs much better for detecting and correcting accuracy and security
problems.

通过对传统的可解释性测试进行评估，发现在复杂数据方面存在随机优势问题。为了解决这个问题，我们提出了三种基于趋势的可信度测试，并通过实证研究证明新的趋势测试可以更好地评估图像、自然语言和安全任务的可信度。我们实施了评估系统并评估了十种常用的解释方法，从中获得了前所未有的发现，启发了未来的研究。同时，可信度测试也极大地提高了下游任务的效果。例如，配备可靠的解释方法的模型调试在检测和修正准确性和安全性问题方面表现出更好的性能。