Recent capability increases in large language models (LLMs) open up applications in which teams of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. Modern steganographic techniques could render such dynamics hard to detect. In this paper, we comprehensively formalise the problem of secret collusion in systems of generative AI agents by drawing on relevant concepts from both the AI and security literature. We study incentives for the use of steganography, and propose a variety of mitigation measures. Our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. We provide extensive empirical results across a range of contemporary LLMs. While the steganographic capabilities of current models remain limited, GPT-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. We conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative AI models.

最近大型语言模型的能力提升为团队之间的生成型人工智能代理解决联合任务的应用打开了大门，而这也引发了隐私和安全方面的挑战，涉及未经授权的信息共享或其他不必要的代理协调形式。本文通过借鉴人工智能和安全领域的相关概念，全面地形式化了生成型人工智能代理系统中秘密勾结的问题。我们研究了使用隐写术的动机，并提出了各种缓解措施。我们的研究得出了一个模型评估框架，系统地测试了各种形式的秘密勾结所需的能力。我们提供了在各种当代大型语言模型上进行的广泛实证结果。尽管当前模型的隐写能力仍然有限，但GPT-4显示出了一种能力跳跃，表明需要持续监测隐写前沿模型的能力。最后，我们提出了一个全面的研究计划，以缓解未来生成型人工智能模型之间勾结的风险。

生成型AI代理间的秘密勾结