We propose an auditing method to identify whether a large language model
(LLM) encodes patterns such as hallucinations in its internal states, which may
propagate to downstream tasks. We introduce a weakly supervised auditing
technique using a subset scanning approach to detect anomalous patterns in LLM
activations from pre-trained models. Importantly, our method does not need
knowledge of the type of patterns a-priori. Instead, it relies on a reference
dataset devoid of anomalies during testing. Further, our approach enables the
identification of pivotal nodes responsible for encoding these patterns, which
may offer crucial insights for fine-tuning specific sub-networks for bias
mitigation. We introduce two new scanning methods to handle LLM activations for
anomalous sentences that may deviate from the expected distribution in either
direction. Our results confirm prior findings of BERT's limited internal
capacity for encoding hallucinations, while OPT appears capable of encoding
hallucination information internally. Importantly, our scanning approach,
without prior exposure to false statements, performs comparably to a fully
supervised out-of-distribution classifier.

我们提出了一种审计方法，用于确定大型语言模型中是否编码了诸如幻觉等模式，并可向下游任务传播。我们引入了一种弱监督的审计技术，使用子集扫描方法来检测预训练模型中 LLM 激活的异常模式。重要的是，我们的方法不需要先验知识来了解模式的类型，而是依赖于在测试期间不含异常的参考数据集。此外，我们的方法还可以确定编码这些模式的关键节点，这可能为细调特定子网络以减轻偏见提供关键见解。我们引入了两种新的扫描方法来处理可能偏离预期分布的异常句子中的 LLM 激活。我们的结果证实了 BERT 在编码幻觉方面内部能力有限，而 OPT 似乎能够在内部编码幻觉信息。重要的是，我们的扫描方法在没有事先暴露于虚假陈述的情况下，表现出与完全监督的离群样本分类器相当的性能。

LLM 激活中的幻觉弱监督检测

Weakly Supervised Detection of Hallucinations in LLM Activations

With an increased focus on incorporating fairness in machine learning models,
it becomes imperative not only to assess and mitigate bias at each stage of the
machine learning pipeline but also to understand the downstream impacts of bias
across stages. Here we consider a general, but realistic, scenario in which a
predictive model is learned from (potentially biased) training data, and model
predictions are assessed post-hoc for fairness by some auditing method. We
provide a theoretical analysis of how a specific form of data bias,
differential sampling bias, propagates from the data stage to the prediction
stage. Unlike prior work, we evaluate the downstream impacts of data biases
quantitatively rather than qualitatively and prove theoretical guarantees for
detection. Under reasonable assumptions, we quantify how the amount of bias in
the model predictions varies as a function of the amount of differential
sampling bias in the data, and at what point this bias becomes provably
detectable by the auditor. Through experiments on two criminal justice datasets
-- the well-known COMPAS dataset and historical data from NYPD's stop and frisk
policy -- we demonstrate that the theoretical results hold in practice even
when our assumptions are relaxed.

本文研究了在从（可能有偏见的）训练数据中学习预测模型，并通过某种审计方法事后评估公平性的一般性情况，通过定量而非定性地评估数据偏差的下游影响并证明检测的理论保证。