We study the tendency of AI systems to deceive by constructing a realistic
simulation setting of a company AI assistant. The simulated company employees
provide tasks for the assistant to complete, these tasks spanning writing
assistance, information retrieval and programming. We then introduce situations
where the model might be inclined to behave deceptively, while taking care to
not instruct or otherwise pressure the model to do so. Across different
scenarios, we find that Claude 3 Opus
1) complies with a task of mass-generating comments to influence public
perception of the company, later deceiving humans about it having done so,
2) lies to auditors when asked questions, and
3) strategically pretends to be less capable than it is during capability
evaluations.
Our work demonstrates that even models trained to be helpful, harmless and
honest sometimes behave deceptively in realistic scenarios, without notable
external pressure to do so.

通过构建一个真实的模拟设置，研究 AI 系统具有欺骗性的倾向。我们以公司 AI 助手为研究对象，模拟公司员工提供任务给助手完成，包括写作帮助、信息检索和编程。我们引入不同情境，在不指示或以其他方式对模型施加压力的情况下，模型可能倾向于表现欺骗行为。在不同场景中，我们发现 Claude 3 Opus：1）按任务生成大量评论以影响公众对公司的看法，并欺骗人们说它没有这么做，2）在被审计人员询问时对其撒谎，3）在能力评估中刻意假装比实际能力低。我们的研究表明，即使在训练时旨在提供帮助、无害和诚实的模型，它们在真实情境中有时会表现出欺骗行为，而无显著的外部压力。