The self-rationalising capabilities of large language models (LLMs) have been
explored in restricted settings, using task/specific data sets. However,
current LLMs do not (only) rely on specifically annotated data; nonetheless,
they frequently explain their outputs. The properties of the generated
explanations are influenced by the pre-training corpus and by the target data
used for instruction fine-tuning. As the pre-training corpus includes a large
amount of human-written explanations "in the wild", we hypothesise that LLMs
adopt common properties of human explanations. By analysing the outputs for a
multi-domain instruction fine-tuning data set, we find that generated
explanations show selectivity and contain illustrative elements, but less
frequently are subjective or misleading. We discuss reasons and consequences of
the properties' presence or absence. In particular, we outline positive and
negative implications depending on the goals and user groups of the
self-rationalising system.

大语言模型的自我合理化能力在受限环境下得到了探索，当前的语言模型不仅依赖特定注释数据，还经常对其输出进行解释，生成的解释具有人类解释的常见属性。通过对多领域训练数据集的输出进行分析，我们发现生成的解释表现出选择性和包含说明性元素，但不太主观或误导性，我们讨论了这些属性存在与缺失的原因和后果，特别是根据自我合理化系统的目标和用户群体，概述了正面和负面的影响。

LLM 生成的解释的属性和挑战

Properties and Challenges of LLM-Generated Explanations

The self-rationalising capabilities of LLMs are appealing because the
generated explanations can give insights into the plausibility of the
predictions. However, how faithful the explanations are to the predictions is
questionable, raising the need to explore the patterns behind them further. To
this end, we propose a hypothesis-driven statistical framework. We use a
Bayesian network to implement a hypothesis about how a task (in our example,
natural language inference) is solved, and its internal states are translated
into natural language with templates. Those explanations are then compared to
LLM-generated free-text explanations using automatic and human evaluations.
This allows us to judge how similar the LLM's and the Bayesian network's
decision processes are. We demonstrate the usage of our framework with an
example hypothesis and two realisations in Bayesian networks. The resulting
models do not exhibit a strong similarity to GPT-3.5. We discuss the
implications of this as well as the framework's potential to approximate LLM
decisions better in future work.

我们提出了一个基于假设的统计框架，使用贝叶斯网络将任务的内部状态与模板翻译成自然语言，然后将这些解释与 LLM 生成的自由文本解释进行比较，以判断 LLM 和贝叶斯网络的决策过程的相似性，结果显示贝叶斯网络模型与 GPT-3.5 并没有很强的相似性，进一步工作可以通过该框架更好地近似 LLM 的决策。