A significant aspiration of offline reinforcement learning (RL) is to develop a generalist agent with high capabilities from large and heterogeneous datasets. However, prior approaches that scale offline RL either rely heavily on expert trajectories or struggle to generalize to diverse unseen tasks. Inspired by the excellent generalization of world model in conditional video generation, we explore the potential of image observation-based world model for scaling offline RL and enhancing generalization on novel tasks. In this paper, we introduce JOWA: Jointly-Optimized World-Action model, an offline model-based RL agent pretrained on multiple Atari games to learn general-purpose representation and decision-making ability. Our method jointly optimizes a world-action model through shared transformer backbone, which stabilize temporal difference learning with large models during pretraining. Moreover, we propose an provably efficient and parallelizable planning algorithm to compensate for the Q-value estimation error and thus search out better policies. Experimental results indicate that our largest agent, with 150 million parameters, achieves 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange. Furthermore, JOWA scales favorably with model capacity and can sample-efficiently transfer to novel games using only 5k offline fine-tuning data corresponding to about 4 trajectories per game, which demonstrates superior generalization of JOWA. We will release codes at https://github.com/CJReinforce/JOWA.

本研究解决了离线强化学习中建立通用智能体的难题，特别是在缺乏专家轨迹和普遍性任务泛化方面的局限。我们提出了JOWA模型，这是一种基于多个Atari游戏进行预训练的离线模型，能够学习通用表示和决策能力。实验证明，该模型在仅使用10%的离线数据情况下，表现超过现有基线，显示了其在新游戏上的高效迁移和优越泛化能力。

通过联合优化的世界-动作模型预训练扩展离线基于模型的强化学习