We study how vision-language models (VLMs) trained on web-scale data can be
integrated into end-to-end driving systems to boost generalization and enable
interactivity with human users. While recent approaches adapt VLMs to driving
via single-round visual question answering (VQA), human drivers reason about
decisions in multiple steps. Starting from the localization of key objects,
humans estimate object interactions before taking actions. The key insight is
that with our proposed task, Graph VQA, where we model graph-structured
reasoning through perception, prediction and planning question-answer pairs, we
obtain a suitable proxy task to mimic the human reasoning process. We
instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose
a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA
and end-to-end driving. The experiments demonstrate that Graph VQA provides a
simple, principled framework for reasoning about a driving scene, and
DriveLM-Data provides a challenging benchmark for this task. Our DriveLM-Agent
baseline performs end-to-end autonomous driving competitively in comparison to
state-of-the-art driving-specific architectures. Notably, its benefits are
pronounced when it is evaluated zero-shot on unseen objects or sensor
configurations. We hope this work can be the starting point to shed new light
on how to apply VLMs for autonomous driving. To facilitate future research, all
code, data, and models are available to the public.

我们研究了如何将在网络规模的数据上训练的视觉 - 语言模型（VLMs）整合到端到端驾驶系统中，以增强泛化能力，并实现与人类用户的互动。通过在感知、预测和规划等方面建立图结构推理的问答对模型，我们提出了 Graph VQA 任务，以模拟人类的推理过程。我们构建了基于 nuScenes 和 CARLA 的数据集（DriveLM-Data），并提出了一个基于 VLM 的基准方法（DriveLM-Agent），用于同时进行 Graph VQA 和端到端驾驶。实验证明 Graph VQA 为驾驶场景的推理提供了简单和有原则的框架，DriveLM-Data 为这一任务提供了具有挑战性的基准。我们的 DriveLM-Agent 基线在与最先进的专用驾驶架构相比的端到端自动驾驶方面表现出了竞争力。值得注意的是，当其在未见过的对象或传感器配置上进行零样本评估时，其效果更为显著。希望这项工作能为如何将 VLMs 应用于自动驾驶提供新的启示。为了促进未来的研究，我们将所有的代码、数据和模型公开提供。