In this work we create agents that can perform well beyond a single, individual task, that exhibit much wider generalisation of behaviour to a massive, rich space of challenges. We define a universe of tasks within an environment domain and demonstrate the ability to train agents that are generally capable across this vast space and beyond. The environment is natively multi-agent, spanning the continuum of competitive, cooperative, and independent games, which are situated within procedurally generated physical 3D worlds. The resulting space is exceptionally diverse in terms of the challenges posed to agents, and as such, even measuring the learning progress of an agent is an open research problem. We propose an iterative notion of improvement between successive generations of agents, rather than seeking to maximise a singular objective, allowing us to quantify progress despite tasks being incomparable in terms of achievable rewards. We show that through constructing an open-ended learning process, which dynamically changes the training task distributions and training objectives such that the agent never stops learning, we achieve consistent learning of new behaviours. The resulting agent is able to score reward in every one of our humanly solvable evaluation levels, with behaviour generalising to many held-out points in the universe of tasks. Examples of this zero-shot generalisation include good performance on Hide and Seek, Capture the Flag, and Tag. Through analysis and hand-authored probe tasks we characterise the behaviour of our agent, and find interesting emergent heuristic behaviours such as trial-and-error experimentation, simple tool use, option switching, and cooperation. Finally, we demonstrate that the general capabilities of this agent could unlock larger scale transfer of behaviour through cheap finetuning.

本文介绍了一种基于多智能体、开放式学习的方法，其能够使得智能体在一种包含大量挑战、跨越多个任务、更广泛的行为通用化领域中表现出非凡的学习能力。通过在环境中建立一个任务的宇宙，我们的训练代理能够跨越更广泛的任务领域，这个领域自然多智能体，涉及合作竞争等多种类型的游戏，而这一领域的挑战对于智能体来说多种多样，因此，我们提出了一种迭代方法来改进代理的效果，而不是试图最大化一个单一目标。最终，我们证明了这种代理的通用能力，可以通过简单的微调实现更大规模的行为传递。

开放式学习导致通用能力的代理