Large Language Models (LLMs) are deployed as powerful tools for several
natural language processing (NLP) applications. Recent works show that modern
LLMs can generate self-explanations (SEs), which elicit their intermediate
reasoning steps for explaining their behavior. Self-explanations have seen
widespread adoption owing to their conversational and plausible nature.
However, there is little to no understanding of their faithfulness. In this
work, we discuss the dichotomy between faithfulness and plausibility in SEs
generated by LLMs. We argue that while LLMs are adept at generating plausible
explanations -- seemingly logical and coherent to human users -- these
explanations do not necessarily align with the reasoning processes of the LLMs,
raising concerns about their faithfulness. We highlight that the current trend
towards increasing the plausibility of explanations, primarily driven by the
demand for user-friendly interfaces, may come at the cost of diminishing their
faithfulness. We assert that the faithfulness of explanations is critical in
LLMs employed for high-stakes decision-making. Moreover, we urge the community
to identify the faithfulness requirements of real-world applications and ensure
explanations meet those needs. Finally, we propose some directions for future
work, emphasizing the need for novel methodologies and frameworks that can
enhance the faithfulness of self-explanations without compromising their
plausibility, essential for the transparent deployment of LLMs in diverse
high-stakes domains.

大型语言模型的自解释性及其在高风险决策中的忠诚度与可信度之间的矛盾。

忠实性与可信度：大型语言模型解释的（不）可靠性

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations  from Large Language Models

Instruction-tuned large language models (LLMs) excel at many tasks, and will
even provide explanations for their behavior. Since these models are directly
accessible to the public, there is a risk that convincing and wrong
explanations can lead to unsupported confidence in LLMs. Therefore,
interpretability-faithfulness of self-explanations is an important
consideration for AI Safety. Assessing the interpretability-faithfulness of
these explanations, termed self-explanations, is challenging as the models are
too complex for humans to annotate what is a correct explanation. To address
this, we propose employing self-consistency checks as a measure of
faithfulness. For example, if an LLM says a set of words is important for
making a prediction, then it should not be able to make the same prediction
without these words. While self-consistency checks are a common approach to
faithfulness, they have not previously been applied to LLM's self-explanations.
We apply self-consistency checks to three types of self-explanations:
counterfactuals, importance measures, and redactions. Our work demonstrate that
faithfulness is both task and model dependent, e.g., for sentiment
classification, counterfactual explanations are more faithful for Llama2,
importance measures for Mistral, and redaction for Falcon 40B. Finally, our
findings are robust to prompt-variations.

利用自洽性检查作为一种忠实度测量，将其应用于大型语言模型自我解释的三种类型，即反事实解释、重要性度量和删除。通过不同任务和模型，发现忠实度是任务和模型相关的，例如对于情感分类，Llama2 的反事实解释、Mistral 的重要性度量和 Falcon 40B 的删除是更加忠实的。最后，我们的发现在提示变体方面是稳健的。

大型语言模型能自我解释吗？

Can Large Language Models Explain Themselves?

Large language models (LLMs) such as ChatGPT have demonstrated superior
performance on a variety of natural language processing (NLP) tasks including
sentiment analysis, mathematical reasoning and summarization. Furthermore,
since these models are instruction-tuned on human conversations to produce
"helpful" responses, they can and often will produce explanations along with
the response, which we call self-explanations. For example, when analyzing the
sentiment of a movie review, the model may output not only the positivity of
the sentiment, but also an explanation (e.g., by listing the sentiment-laden
words such as "fantastic" and "memorable" in the review). How good are these
automatically generated self-explanations? In this paper, we investigate this
question on the task of sentiment analysis and for feature attribution
explanation, one of the most commonly studied settings in the interpretability
literature (for pre-ChatGPT models). Specifically, we study different ways to
elicit the self-explanations, evaluate their faithfulness on a set of
evaluation metrics, and compare them to traditional explanation methods such as
occlusion or LIME saliency maps. Through an extensive set of experiments, we
find that ChatGPT's self-explanations perform on par with traditional ones, but
are quite different from them according to various agreement metrics, meanwhile
being much cheaper to produce (as they are generated along with the
prediction). In addition, we identified several interesting characteristics of
them, which prompt us to rethink many current model interpretability practices
in the era of ChatGPT(-like) LLMs.

ChatGPT 的自解释性能与传统方法相媲美，在成本较低的情况下，且具有许多有趣的特性，促使我们重新思考当前在 ChatGPT（类似的 LLM）时代的模型可解释性实践。