Recently, the pure camera-based Bird's-Eye-View (BEV) perception provides a feasible solution for economical autonomous driving. However, the existing BEV-based multi-view 3D detectors generally transform all image features into BEV features, without considering the problem that the large proportion of background information may submerge the object information. In this paper, we propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out background information according to the semantic segmentation of image features and transform image features into semantic-aware BEV features. Accordingly, we propose BEV-Paste, an effective data augmentation strategy that closely matches with semantic-aware BEV feature. In addition, we design a Multi-Scale Cross-Task (MSCT) head, which combines task-specific and cross-task information to predict depth distribution and semantic segmentation more accurately, further improving the quality of semantic-aware BEV feature. Finally, we integrate the above modules into a novel multi-view 3D object detection framework, namely SA-BEV. Experiments on nuScenes show that SA-BEV achieves state-of-the-art performance. Code has been available at https://github.com/mengtan00/SA-BEV.git.

本文介绍了一种Semantic-Aware BEV Pooling (SA-BEVPool)的方法，通过语义分割图像特征来过滤背景信息，并将图像特征转化为语义感知的BEV特征。同时，提出了一种与语义感知BEV特征相匹配的有效数据增强策略BEV-Paste。此外，设计了一个多尺度交叉任务头（MSCT），结合特定任务和交叉任务的信息来更准确地预测深度分布和语义分割，进一步提高语义感知BEV特征的质量。最后，将这些模块整合到一个新的多视角3D目标检测框架SA-BEV中，在nuScenes数据集上达到了最先进的性能。

SA-BEV: 多视角 3D 目标检测中生成语义感知鸟瞰特征