Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

提出了一种使用纯无监督方式直接在语言模型的内部激活中查找潜在知识的方法，通过在激活空间中找到满足逻辑一致性属性的方向，可以精确回答只有未标注模型激活的肯定-否定问题。在跨6个模型和10个问答数据集的情况下，尽管不使用监督和模型输出，该方法可以恢复大型语言模型中的各种知识，并且平均超过零-shot准确性4％。结果初步表明，即使我们无法访问显式的基础真值标签，也可以发现语言模型所知道的与它们所说的不同。同时，该方法可以将prompt敏感度减半，并在要求模型生成错误答案时仍然保持高准确性。

无须监督，发现语言模型中的潜在知识