Jointly processing information from multiple sensors is crucial to achieving
accurate and robust perception for reliable autonomous driving systems.
However, current 3D perception research follows a modality-specific paradigm,
leading to additional computation overheads and inefficient collaboration
between different sensor data. In this paper, we present an efficient
multi-modal backbone for outdoor 3D perception named UniTR, which processes a
variety of modalities with unified modeling and shared parameters. Unlike
previous works, UniTR introduces a modality-agnostic transformer encoder to
handle these view-discrepant sensor data for parallel modal-wise representation
learning and automatic cross-modal interaction without additional fusion steps.
More importantly, to make full use of these complementary sensor types, we
present a novel multi-modal integration strategy by both considering
semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood
relations. UniTR is also a fundamentally task-agnostic backbone that naturally
supports different 3D perception tasks. It sets a new state-of-the-art
performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object
detection and +12.0 higher mIoU for BEV map segmentation with lower inference
latency. Code will be available at this https URL .

UniTR 是一种高效的多模态骨干网络，用于处理多传感器数据并实现准确可靠的自动驾驶系统的感知。它引入了一种模态不可知的转换编码器来处理不同的传感器数据，并通过视角不一致的传感器数据进行并行模态表示学习和自动跨模态交互，无需额外的融合步骤。它通过同时考虑语义丰富的 2D 透视图和几何感知的 3D 稀疏邻域关系，提出了一种新颖的多模态融合策略。在 nuScenes 评测上，UniTR 在 3D 目标检测方面取得了 + 1.1 NDS 的提高，在 BEV 地图分割方面取得了 + 12.0 mIoU 的提高，并具有较低的推理延迟。