The emergence of pre-trained AI systems with powerful capabilities across a diverse and ever-increasing set of complex domains has raised a critical challenge for AI safety as tasks can become too complicated for humans to judge directly. Irving et al. [2018] proposed a debate method in this direction with the goal of pitting the power of such AI models against each other until the problem of identifying (mis)-alignment is broken down into a manageable subtask. While the promise of this approach is clear, the original framework was based on the assumption that the honest strategy is able to simulate deterministic AI systems for an exponential number of steps, limiting its applicability. In this paper, we show how to address these challenges by designing a new set of debate protocols where the honest strategy can always succeed using a simulation of a polynomial number of steps, whilst being able to verify the alignment of stochastic AI systems, even when the dishonest strategy is allowed to use exponentially many simulation steps.

通过设计一套新的辩论协议，本文展示了如何解决 AI 安全中的挑战，其中诚实策略能够使用多项式数量的步骤来成功模拟预训练 AI 系统，同时能够验证随机 AI 系统的对齐性，即使不诚实策略允许使用指数数量的模拟步骤。

通过双倍高效辩论实现可扩展的人工智能安全