We study how well large language models (LLMs) explain their generations with rationales -- a set of tokens extracted from the input texts that reflect the decision process of LLMs. We examine LLM rationales extracted with two methods: 1) attribution-based methods that use attention or gradients to locate important tokens, and 2) prompting-based methods that guide LLMs to extract rationales using prompts. Through extensive experiments, we show that prompting-based rationales align better with human-annotated rationales than attribution-based rationales, and demonstrate reasonable alignment with humans even when model performance is poor. We additionally find that the faithfulness limitations of prompting-based methods, which are identified in previous work, may be linked to their collapsed predictions. By fine-tuning these models on the corresponding datasets, both prompting and attribution methods demonstrate improved faithfulness. Our study sheds light on more rigorous and fair evaluations of LLM rationales, especially for prompting-based ones.

我们研究了大型语言模型（LLM）如何通过原因来解释其生成的模式，它们是从输入文本中提取出来的一组标记，反映了LLM的决策过程。我们使用两种方法提取LLM原因：1）基于归因的方法使用注意力或梯度来定位重要的标记，以及2）基于提示的方法使用提示来引导LLM提取原因。通过广泛的实验，我们展示了基于提示的原因与人工注释的原因更好地对齐，即使模型性能差，也能合理地与人类对齐。此外，我们还发现基于提示的方法的忠实度限制可能与它们的折叠预测有关。通过在相应的数据集上微调这些模型，无论是提示方法还是归因方法都展现了更好的忠实度。我们的研究为更严格和公正地评估LLM原因提供了启示，尤其是基于提示的方法。

评估LLM Rationale的人类对齐度和模型忠实度