We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate to downstream tasks. We introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in LLM activations from pre-trained models. Importantly, our method does not need knowledge of the type of patterns a-priori. Instead, it relies on a reference dataset devoid of anomalies during testing. Further, our approach enables the identification of pivotal nodes responsible for encoding these patterns, which may offer crucial insights for fine-tuning specific sub-networks for bias mitigation. We introduce two new scanning methods to handle LLM activations for anomalous sentences that may deviate from the expected distribution in either direction. Our results confirm prior findings of BERT's limited internal capacity for encoding hallucinations, while OPT appears capable of encoding hallucination information internally. Importantly, our scanning approach, without prior exposure to false statements, performs comparably to a fully supervised out-of-distribution classifier.

我们提出了一种审计方法，用于确定大型语言模型中是否编码了诸如幻觉等模式，并可向下游任务传播。我们引入了一种弱监督的审计技术，使用子集扫描方法来检测预训练模型中LLM激活的异常模式。重要的是，我们的方法不需要先验知识来了解模式的类型，而是依赖于在测试期间不含异常的参考数据集。此外，我们的方法还可以确定编码这些模式的关键节点，这可能为细调特定子网络以减轻偏见提供关键见解。我们引入了两种新的扫描方法来处理可能偏离预期分布的异常句子中的LLM激活。我们的结果证实了BERT在编码幻觉方面内部能力有限，而OPT似乎能够在内部编码幻觉信息。重要的是，我们的扫描方法在没有事先暴露于虚假陈述的情况下，表现出与完全监督的离群样本分类器相当的性能。

LLM激活中的幻觉弱监督检测