Typically, autonomous driving adopts a modular design, which divides the full
stack into perception, prediction, planning and control parts. Though
interpretable, such modular design tends to introduce a substantial amount of
redundancy. Recently, multimodal large language models (MLLM) and diffusion
techniques have demonstrated their superior performance on comprehension and
generation ability. In this paper, we first introduce the concept of
interleaved vision-action pair, which unifies the format of visual features and
control signals. Based on the vision-action pairs, we construct a general world
model based on MLLM and diffusion model for autonomous driving, termed
ADriver-I. It takes the vision-action pairs as inputs and autoregressively
predicts the control signal of the current frame. The generated control signals
together with the historical vision-action pairs are further conditioned to
predict the future frames. With the predicted next frame, ADriver-I performs
further control signal prediction. Such a process can be repeated infinite
times, ADriver-I achieves autonomous driving in the world created by itself.
Extensive experiments are conducted on nuScenes and our large-scale private
datasets. ADriver-I shows impressive performance compared to several
constructed baselines. We hope our ADriver-I can provide some new insights for
future autonomous driving and embodied intelligence.

基于多模态大语言模型和扩散技术，我们提出了一种自主驾驶世界模型 ADriver-I，该模型以交织的视觉 - 动作对为基础，能够预测当前帧的控制信号，并使用历史的视觉 - 动作对和生成的控制信号来预测未来的帧，通过无限反馈循环，ADriver-I 实现了自主驾驶。我们通过在 nuScenes 和大规模私有数据集上进行广泛实验，证明了 ADriver-I 在性能上的卓越表现，希望该模型能为未来自主驾驶和具身智能提供新的洞见。