Designing an efficient yet deployment-friendly 3D backbone to handle sparse point clouds is a fundamental problem in 3D object detection. Compared with the customized sparse convolution, the attention mechanism in Transformers is more appropriate for flexibly modeling long-range relationships and is easier to be deployed in real-world applications. However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D object detection. In order to efficiently process sparse points in parallel, we propose Dynamic Sparse Window Attention, which partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. To allow the cross-set connection, we design a rotated set partitioning strategy that alternates between two partitioning configurations in consecutive self-attention layers. To support effective downsampling and better encode geometric information, we also propose an attention-style 3D pooling module on sparse points, which is powerful and deployment-friendly without utilizing any customized CUDA operations. Our model achieves state-of-the-art performance on large-scale Waymo Open Dataset with remarkable gains. More importantly, DSVT can be easily deployed by TensorRT with real-time inference speed (27Hz). Code will be available at \url{https://github.com/Haiyang-W/DSVT}.

本文介绍了动态稀疏体素变换器（DSVT），它是一个用于室外3D感知的单步幅基于体素的转换器骨干。为了有效处理稀疏点云，我们提出了动态稀疏窗口注意力，这将每个窗口中的一系列局部区域根据其稀疏性划分，并以完全并行的方式计算所有区域的特征。在本文中，我们的模型实现了具有广泛3D感知任务的最先进的性能，并且可以轻松通过TensorRT进行实时推理速度（27Hz）的部署。

DSVT：具有旋转集的动态稀疏体素变换器