A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet real-world autonomous tests face several unique safety challenges, both due to the possibility of causing harm during a test, as well as the risk of encountering new unsafe agent behavior through interactions with real-world and potentially malicious actors. We propose a framework for conducting safe autonomous agent tests on the open internet: agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans. We a design a basic safety monitor that is flexible enough to monitor existing LLM agents, and, using an adversarial simulated agent, we measure its ability to identify and stop unsafe situations. Then we apply the safety monitor on a battery of real-world tests of AutoGPT, and we identify several limitations and challenges that will face the creation of safe in-the-wild tests as autonomous agents grow more capable.

在野外安全自主性的先决条件是进行安全的测试。我们提出了一个基于互联网的安全自主智能体测试框架，通过上下文敏感的监视器对智能体的行为进行审计，强制实施严格的安全边界来阻止不安全的测试，并将可疑行为进行排名和记录以供人工审查。我们设计了一个灵活的基础安全监视器来监控现有LLM智能体，并使用对抗性模拟智能体来测试其识别和停止不安全情况的能力。然后，我们将安全监视器应用于AutoGPT的一系列现实世界测试中，识别了一些存在的限制和挑战，这些将是随着自主智能体的能力增强，创建安全的野外测试时将面临的问题。

在野外安全地测试语言模型代理