Instruction-tuned large language models (LLMs) excel at many tasks, and will
even provide explanations for their behavior. Since these models are directly
accessible to the public, there is a risk that convincing and wrong
explanations can lead to unsupported confidence in LLMs. Therefore,
interpretability-faithfulness of self-explanations is an important
consideration for AI Safety. Assessing the interpretability-faithfulness of
these explanations, termed self-explanations, is challenging as the models are
too complex for humans to annotate what is a correct explanation. To address
this, we propose employing self-consistency checks as a measure of
faithfulness. For example, if an LLM says a set of words is important for
making a prediction, then it should not be able to make the same prediction
without these words. While self-consistency checks are a common approach to
faithfulness, they have not previously been applied to LLM's self-explanations.
We apply self-consistency checks to three types of self-explanations:
counterfactuals, importance measures, and redactions. Our work demonstrate that
faithfulness is both task and model dependent, e.g., for sentiment
classification, counterfactual explanations are more faithful for Llama2,
importance measures for Mistral, and redaction for Falcon 40B. Finally, our
findings are robust to prompt-variations.

利用自洽性检查作为一种忠实度测量，将其应用于大型语言模型自我解释的三种类型，即反事实解释、重要性度量和删除。通过不同任务和模型，发现忠实度是任务和模型相关的，例如对于情感分类，Llama2 的反事实解释、Mistral 的重要性度量和 Falcon 40B 的删除是更加忠实的。最后，我们的发现在提示变体方面是稳健的。