Recent Transformer-based 3D object detectors learn point cloud features either from point- or voxel-based representations. However, the former requires time-consuming sampling while the latter introduces quantization errors. In this paper, we present a novel Point-Voxel Transformer for single-stage 3D detection (PVT-SSD) that takes advantage of these two representations. Specifically, we first use voxel-based sparse convolutions for efficient feature encoding. Then, we propose a Point-Voxel Transformer (PVT) module that obtains long-range contexts in a cheap manner from voxels while attaining accurate positions from points. The key to associating the two different representations is our introduced input-dependent Query Initialization module, which could efficiently generate reference points and content queries. Then, PVT adaptively fuses long-range contextual and local geometric information around reference points into content queries. Further, to quickly find the neighboring points of reference points, we design the Virtual Range Image module, which generalizes the native range image to multi-sensor and multi-frame. The experiments on several autonomous driving benchmarks verify the effectiveness and efficiency of the proposed method. Code will be available at https://github.com/Nightmare-n/PVT-SSD.

本论文提供了一种新型的点块Transformer用于单级三维检测(PVT-SSD),其使用基于体素的稀疏卷积进行有效的特征编码，并从体素中以便宜的方式获取长距离上下文，同时从点中获取准确的位置并通过引入依赖于输入的查询初始化模块关联这两种不同的表示。进一步地，通过设计Virtual Range Image模块，该方法能够快速找到参考点的邻近点。该方法在几个自动驾驶基准测试中得到了验证，表明其有效性和高效性。

PVT-SSD：使用点-体素变换的单级三维物体探测器