Many existing motion prediction approaches rely on symbolic perception
outputs to generate agent trajectories, such as bounding boxes, road graph
information and traffic lights. This symbolic representation is a high-level
abstraction of the real world, which may render the motion prediction model
vulnerable to perception errors (e.g., failures in detecting open-vocabulary
obstacles) while missing salient information from the scene context (e.g., poor
road conditions). An alternative paradigm is end-to-end learning from raw
sensors. However, this approach suffers from the lack of interpretability and
requires significantly more training resources. In this work, we propose
tokenizing the visual world into a compact set of scene elements and then
leveraging pre-trained image foundation models and LiDAR neural networks to
encode all the scene elements in an open-vocabulary manner. The image
foundation model enables our scene tokens to encode the general knowledge of
the open world while the LiDAR neural network encodes geometry information. Our
proposed representation can efficiently encode the multi-frame multi-modality
observations with a few hundred tokens and is compatible with most
transformer-based architectures. To evaluate our method, we have augmented
Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open
Motion Dataset show that our approach leads to significant performance
improvements over the state-of-the-art.

通过将视觉世界划分为场景元素，并利用预训练的图像模型和 LiDAR 神经网络以开放词汇的方式编码所有场景元素，我们提出了一种能够高效编码多帧多模态观察的方法，并且在 Waymo Open Motion 数据集上的实验证明了我们的方法明显优于现有技术。