Large language models (LLMs) demonstrate significant knowledge through their
outputs, though it is often unclear whether false outputs are due to a lack of
knowledge or dishonesty. In this paper, we investigate instructed dishonesty,
wherein we explicitly prompt LLaMA-2-70b-chat to lie. We perform prompt
engineering to find which prompts best induce lying behavior, and then use
mechanistic interpretability approaches to localize where in the network this
behavior occurs. Using linear probing and activation patching, we localize five
layers that appear especially important for lying. We then find just 46
attention heads within these layers that enable us to causally intervene such
that the lying model instead answers honestly. We show that these interventions
work robustly across many prompts and dataset splits. Overall, our work
contributes a greater understanding of dishonesty in LLMs so that we may hope
to prevent it.

通过对大型语言模型进行研究，本文探究了指示性不诚实，即明确要求 LLaMA-2-70b-chat 撒谎，通过提示工程方法找到了最能引起撒谎行为的提示语，并使用机械性可解释性方法定位了网络中发生这种行为的位置，在这五个层中找出 46 个特别重要的注意力头，使我们能够有针对性地干预以使撒谎模型诚实回答问题，我们展示了这些干预对于多个提示和数据集分割都具有稳健的效果，总体而言，我们的工作有助于更深入理解 LLMs 中的不诚实行为，以便我们能够希望防止它的发生。