Understanding the real world through point cloud video is a crucial aspect of
robotics and autonomous driving systems. However, prevailing methods for 4D
point cloud recognition have limitations due to sensor resolution, which leads
to a lack of detailed information. Recent advances have shown that
Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can
learn fine-grained visual concepts that can be transferred to various
downstream tasks. However, effectively integrating VLM into the domain of 4D
point clouds remains an unresolved problem. In this work, we propose the
Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from
visual-text pre-trained models to a 4D point cloud network. Our approach
involves aligning the 4D encoder's representation with a VLM to learn a shared
visual and text space from training on large-scale image-text pairs. By
transferring the knowledge of the VLM to the 4D encoder and combining the VLM,
our VG4D achieves improved recognition performance. To enhance the 4D encoder,
we modernize the classic dynamic point cloud backbone and propose an improved
version of PSTNet, im-PSTNet, which can efficiently model point cloud videos.
Experiments demonstrate that our method achieves state-of-the-art performance
for action recognition on both the NTU RGB+D 60 dataset and the NTU RGB+D 120
dataset. Code is available at https://github.com/Shark0-0/VG4D.

通过 Vision-Language Models Goes 4D (VG4D) 框架，我们将 VLM 知识从视觉 - 文本预训练模型转移到 4D 点云网络中，实现了增强的识别性能。我们还提出了改进的 PSTNet 版本 im-PSTNet 来增强 4D 编码器，并通过实验证明了我们方法在动作识别方面达到了最先进的性能。

VG4D：视觉语言模型进入 4D 视频识别

VG4D: Vision-Language Model Goes 4D Video Recognition

In this technical report, we present our findings from the research conducted
on the Human-Object Interaction 4D (HOI4D) dataset for egocentric action
segmentation task. As a relatively novel research area, point cloud video
methods might not be good at temporal modeling, especially for long point cloud
videos (\eg, 150 frames). In contrast, traditional video understanding methods
have been well developed. Their effectiveness on temporal modeling has been
widely verified on many large scale video datasets. Therefore, we convert point
cloud videos into depth videos and employ traditional video modeling methods to
improve 4D action segmentation. By ensembling depth and point cloud video
methods, the accuracy is significantly improved. The proposed method, named
Mixture of Depth and Point cloud video experts (DPMix), achieved the first
place in the 4D Action Segmentation Track of the HOI4D Challenge 2023.

通过将点云视频转换为深度视频并使用传统视频建模方法，提出的深度和点云视频专家混合（DPMix）方法显著提高了四维动作分割的准确性，并在 HOI4D Challenge 2023 的四维动作分割赛道中排名第一。