Multi-view counting (MVC) methods have shown their superiority over single-view counterparts, particularly in situations characterized by heavy occlusion and severe perspective distortions. However, hand-crafted heuristic features and identical camera layout requirements in conventional MVC methods limit their applicability and scalability in real-world scenarios.In this work, we propose a concise 3D MVC framework called \textbf{CountFormer}to elevate multi-view image-level features to a scene-level volume representation and estimate the 3D density map based on the volume features. By incorporating a camera encoding strategy, CountFormer successfully embeds camera parameters into the volume query and image-level features, enabling it to handle various camera layouts with significant differences.Furthermore, we introduce a feature lifting module capitalized on the attention mechanism to transform image-level features into a 3D volume representation for each camera view. Subsequently, the multi-view volume aggregation module attentively aggregates various multi-view volumes to create a comprehensive scene-level volume representation, allowing CountFormer to handle images captured by arbitrary dynamic camera layouts. The proposed method performs favorably against the state-of-the-art approaches across various widely used datasets, demonstrating its greater suitability for real-world deployment compared to conventional MVC frameworks.

提出了一种简洁的三维多视图计数（MVC）框架CountFormer，能够将多视图图像级特征提升到场景级体积表示，并基于体积特征估计三维密度图，通过引入相机编码策略，CountFormer成功地将相机参数嵌入体积查询和图像级特征中，使其能够处理具有显著差异的各种相机布局。同时，借助注意机制的特征修正模块将图像级特征转换为每个相机视图的三维体积表示，然后，多视图体积聚合模块以注意力的方式聚合各种多视图体积，创建综合的场景级体积表示，CountFormer能够处理任意动态相机布局下采集的图像，该方法在各种广泛使用的数据集上表现优于现有方法，显示出与传统MVC框架相比，在真实世界应用方面更加合适。

CountFormer：多视角人群计数变换器