Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs' \textbf{IN}ternal \textbf{S}tates for halluc\textbf{I}nation \textbf{DE}tection (\textbf{INSIDE}). In particular, a simple yet effective \textbf{EigenScore} metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal.

探索LLM内部状态中保留的密集语义信息，提出了一种称为INSIDE的方法，以更好地评估回答的自我一致性。此外，还探索了一种测试时间特征剪裁方法，以减少内部状态中的极端激活，从而减少自信生成并有助于检测过度自信的幻觉。在数个流行的LLMs和问答基准测试上进行了大量实验和消融研究，展示了我们提出方法的有效性。

LLMs的内部状态保持了幻觉检测的能力