Referring video object segmentation (RVOS) relies on natural language
expressions to segment target objects in video, emphasizing modeling dense
text-video relations. The current RVOS methods typically use independently
pre-trained vision and language models as backbones, resulting in a significant
domain gap between video and text. In cross-modal feature interaction, text
features are only used as query initialization and do not fully utilize
important information in the text. In this work, we propose using frozen
pre-trained vision-language models (VLM) as backbones, with a specific emphasis
on enhancing cross-modal feature interaction. Firstly, we use frozen
convolutional CLIP backbone to generate feature-aligned vision and text
features, alleviating the issue of domain gap and reducing training costs.
Secondly, we add more cross-modal feature fusion in the pipeline to enhance the
utilization of multi-modal information. Furthermore, we propose a novel video
query initialization method to generate higher quality video queries. Without
bells and whistles, our method achieved 51.5 J&F on the MeViS test set and
ranked 3rd place for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression
guided Video Segmentation.

提出了一种使用预训练的视觉 - 语言模型作为骨干网络的方法，着重于增强跨模态特征交互，在视频目标分割中取得了显著的改进效果。

CVPR 2024 PVUW 工作坊 MeViS 赛道冠军解决方案：运动表情引导的视频分割

3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion  Expression guided Video Segmentation

Most existing text-video retrieval methods focus on cross-modal matching
between the visual content of videos and textual query sentences. However, in
real-world scenarios, online videos are often accompanied by relevant text
information such as titles, tags, and even subtitles, which can be utilized to
match textual queries. This insight has motivated us to propose a novel
approach to text-video retrieval, where we directly generate associated
captions from videos using zero-shot video captioning with knowledge from
web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated
captions, a natural question arises: what benefits do they bring to text-video
retrieval? To answer this, we introduce Cap4Video, a new framework that
leverages captions in three ways: i) Input data: video-caption pairs can
augment the training data. ii) Intermediate feature interaction: we perform
cross-modal feature interaction between the video and caption to produce
enhanced video representations. iii) Output score: the Query-Caption matching
branch can complement the original Query-Video matching branch for text-video
retrieval. We conduct comprehensive ablation studies to demonstrate the
effectiveness of our approach. Without any post-processing, Cap4Video achieves
state-of-the-art performance on four standard text-video retrieval benchmarks:
MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is
available at this https URL .

本文提出了一种基于 zero-shot video captioning 和 cross-modal feature interaction 的 text-video retrieval 方法，即 Cap4Video，该方法通过增强视频表示和 Input data、Intermediate feature interaction、Output score 三种方式来利用生成的与视频关联的字幕进行 text-video retrieval。经验证，该方法在 MSR-VTT、VATEX、MSVD 和 DiDeMo 四个标准数据集上的表现达到了 state-of-the-art 水平。

Cap4Video: 文本 - 视频检索中辅助字幕的作用

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

Object detection through either RGB images or the LiDAR point clouds has been
extensively explored in autonomous driving. However, it remains challenging to
make these two data sources complementary and beneficial to each other. In this
paper, we propose \textit{AutoAlign}, an automatic feature fusion strategy for
3D object detection. Instead of establishing deterministic correspondence with
camera projection matrix, we model the mapping relationship between the image
and point clouds with a learnable alignment map. This map enables our model to
automate the alignment of non-homogenous features in a dynamic and data-driven
manner. Specifically, a cross-attention feature alignment module is devised to
adaptively aggregate \textit{pixel-level} image features for each voxel. To
enhance the semantic consistency during feature alignment, we also design a
self-supervised cross-modal feature interaction module, through which the model
can learn feature aggregation with \textit{instance-level} feature guidance.
Extensive experimental results show that our approach can lead to 2.3 mAP and
7.0 mAP improvements on the KITTI and nuScenes datasets, respectively. Notably,
our best model reaches 70.9 NDS on the nuScenes testing leaderboard, achieving
competitive performance among various state-of-the-arts.

本研究提出了一种自动特征融合策略 AutoAlign，通过可学习的对齐映射以及交叉注意力特征对齐模块和自监督跨模态特征交互模块实现图像和点云数据源的同步处理，实验结果表明该方法在 KITTI 和 nuScenes 数据集上都有较好的表现。