We study how well large language models (LLMs) explain their generations with
rationales -- a set of tokens extracted from the input texts that reflect the
decision process of LLMs. We examine LLM rationales extracted with two methods:
1) attribution-based methods that use attention or gradients to locate
important tokens, and 2) prompting-based methods that guide LLMs to extract
rationales using prompts. Through extensive experiments, we show that
prompting-based rationales align better with human-annotated rationales than
attribution-based rationales, and demonstrate reasonable alignment with humans
even when model performance is poor. We additionally find that the faithfulness
limitations of prompting-based methods, which are identified in previous work,
may be linked to their collapsed predictions. By fine-tuning these models on
the corresponding datasets, both prompting and attribution methods demonstrate
improved faithfulness. Our study sheds light on more rigorous and fair
evaluations of LLM rationales, especially for prompting-based ones.

我们研究了大型语言模型（LLM）如何通过原因来解释其生成的模式，它们是从输入文本中提取出来的一组标记，反映了 LLM 的决策过程。我们使用两种方法提取 LLM 原因：1）基于归因的方法使用注意力或梯度来定位重要的标记，以及 2）基于提示的方法使用提示来引导 LLM 提取原因。通过广泛的实验，我们展示了基于提示的原因与人工注释的原因更好地对齐，即使模型性能差，也能合理地与人类对齐。此外，我们还发现基于提示的方法的忠实度限制可能与它们的折叠预测有关。通过在相应的数据集上微调这些模型，无论是提示方法还是归因方法都展现了更好的忠实度。我们的研究为更严格和公正地评估 LLM 原因提供了启示，尤其是基于提示的方法。

评估 LLM Rationale 的人类对齐度和模型忠实度

Evaluating Human Alignment and Model Faithfulness of LLM Rationale

Neural network visualization techniques mark image locations by their
relevancy to the network's classification. Existing methods are effective in
highlighting the regions that affect the resulting classification the most.
However, as we show, these methods are limited in their ability to identify the
support for alternative classifications, an effect we name {\em the saliency
bias} hypothesis. In this work, we integrate two lines of research:
gradient-based methods and attribution-based methods, and develop an algorithm
that provides per-class explainability. The algorithm back-projects the per
pixel local influence, in a manner that is guided by the local attributions,
while correcting for salient features that would otherwise bias the
explanation. In an extensive battery of experiments, we demonstrate the ability
of our methods to class-specific visualization, and not just the predicted
label. Remarkably, the method obtains state of the art results in benchmarks
that are commonly applied to gradient-based methods as well as in those that
are employed mostly for evaluating attribution methods. Using a new
unsupervised procedure, our method is also successful in demonstrating that
self-supervised methods learn semantic information.

本文提出了一种结合梯度和属性方法的算法来提供逐类的可解释性，该算法在纠正显著特征的同时，通过反向投影来提供每像素的局部影响，并成功地证明了在各种基准测试上的优越性能，包括用于梯度方法评估的基准测试以及主要用于评估属性方法的基准测试，此外，本文还成功地证明了自监督方法可以学习语义信息。