Despite impressive successes, deep reinforcement learning (RL) systems still fall short of human performance on generalization to new tasks and environments that differ from their training. As a benchmark tailored for studying RL generalization, we introduce Avalon, a set of tasks in which embodied agents in highly diverse procedural 3D worlds must survive by navigating terrain, hunting or gathering food, and avoiding hazards. Avalon is unique among existing RL benchmarks in that the reward function, world dynamics, and action space are the same for every task, with tasks differentiated solely by altering the environment; its 20 tasks, ranging in complexity from eat and throw to hunt and navigate, each create worlds in which the agent must perform specific skills in order to survive. This setup enables investigations of generalization within tasks, between tasks, and to compositional tasks that require combining skills learned from previous tasks. Avalon includes a highly efficient simulator, a library of baselines, and a benchmark with scoring metrics evaluated against hundreds of hours of human performance, all of which are open-source and publicly available. We find that standard RL baselines make progress on most tasks but are still far from human performance, suggesting Avalon is challenging enough to advance the quest for generalizable RL.

通过引入针对强化学习通用化的 Avalon 挑战集，旨在帮助深度强化学习系统更好地适应于新任务和不同的环境，该集合基于高度多样化的 3D 环境，要求机器人体现出导航、狩猎和采集等能力，在每个地图中生存下来，该挑战集不仅限于改变环境仍使用相同的奖励函数、世界动力学和动作空间, 每个任务都要求机器人在复杂程度上有所提升，Avalon 挑战集包括高效的仿真器和基准库，可用于基本检测和评分，现有的标准强化学习基线在大多数任务上都有进展，但仍远非人类表现，说明 Avalon 挑战集足够具有挑战性，可进一步推动深度强化学习通用化研究的进展。

Avalon: 使用程序生成世界的强化学习泛化基准