While pre-training large-scale video-language models (VLMs) has shown
remarkable potential for various downstream video-language tasks, existing VLMs
can still suffer from certain commonly seen limitations, e.g., coarse-grained
cross-modal aligning , under-modeling of temporal dynamics, detached
video-language view. In this work, we target enhancing VLMs with a fine-grained
structural spatio-temporal alignment learning method (namely Finsta). First of
all, we represent the input texts and videos with fine-grained scene graph (SG)
structures, both of which are further unified into a holistic SG (HSG) for
bridging two modalities. Then, an SG-based framework is built, where the
textual SG (TSG) is encoded with a graph Transformer, while the video dynamic
SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for
spatial and temporal feature propagation. A spatial-temporal Gaussian
differential graph Transformer is further devised to strengthen the sense of
the changes in objects across spatial and temporal dimensions. Next, based on
the fine-grained structural features of TSG and DSG, we perform object-centered
spatial alignment and predicate-centered temporal alignment respectively,
enhancing the video-language grounding in both the spatiality and temporality.
We design our method as a plug&play system, which can be integrated into
existing well-trained VLMs for further representation augmentation, without
training from scratch or relying on SG annotations in downstream applications.
On 6 representative VL modeling tasks over 12 datasets in both standard and
long-form video scenarios, Finsta consistently improves the existing 13
strong-performing VLMs persistently, and refreshes the current state-of-the-art
end task performance significantly in both the fine-tuning and zero-shot
settings.

通过精细化的结构化时空对齐学习方法（Finsta），将输入的文本和视频以细粒度场景图（SG）结构表示，进而统一为整体性 SG（HSG），从而加强语义和时序的视频 - 语言对齐，提高大规模视频 - 语言模型（VLMs）在各种下游任务中的性能。

加强视频语言表示的结构时空对齐

Enhancing Video-Language Representations with Structural Spatio-Temporal  Alignment

The recent and increasing interest in video-language research has driven the
development of large-scale datasets that enable data-intensive machine learning
techniques. In comparison, limited effort has been made at assessing the
fitness of these datasets for the video-language grounding task. Recent works
have begun to discover significant limitations in these datasets, suggesting
that state-of-the-art techniques commonly overfit to hidden dataset biases. In
this work, we present MAD (Movie Audio Descriptions), a novel benchmark that
departs from the paradigm of augmenting existing video datasets with text
annotations and focuses on crawling and aligning available audio descriptions
of mainstream movies. MAD contains over 384,000 natural language sentences
grounded in over 1,200 hours of videos and exhibits a significant reduction in
the currently diagnosed biases for video-language grounding datasets. MAD's
collection strategy enables a novel and more challenging version of
video-language grounding, where short temporal moments (typically seconds long)
must be accurately grounded in diverse long-form videos that can last up to
three hours. We have released MAD's data and baselines code at
this https URL

该论文提出了 MAD 基准测试，通过爬取和对齐可用的主流电影音频描述，包含超过 384,000 个自然语言句子，展示出视频语言基础数据集中存在的偏差的显着减少，使短暂的时间点可以准确地与长达三个小时的视频相匹配。

MAD: 电影音频描述视频语言基础数据集

MAD: A Scalable Dataset for Language Grounding in Videos from Movie  Audio Descriptions

In this paper, we explore a novel task named visual Relation Grounding in
Videos (vRGV). The task aims at spatio-temporally localizing the given
relations in the form of subject-predicate-object in the videos, so as to
provide supportive visual facts for other high-level video-language tasks
(e.g., video-language grounding and video question answering). The challenges
in this task include but not limited to: (1) both the subject and object are
required to be spatio-temporally localized to ground a query relation; (2) the
temporal dynamic nature of visual relations in videos is difficult to capture;
and (3) the grounding should be achieved without any direct supervision in
space and time. To ground the relations, we tackle the challenges by
collaboratively optimizing two sequences of regions over a constructed
hierarchical spatio-temporal region graph through relation attending and
reconstruction, in which we further propose a message passing mechanism by
spatial attention shifting between visual entities. Experimental results
demonstrate that our model can not only outperform baseline approaches
significantly, but also produces visually meaningful facts to support visual
grounding. (Code is available at this https URL).

本文介绍了一项新任务：视频中的视觉关系定位，目的是在视频中定位给定的主谓宾形式关系，以提供支持其他高级视频语言任务（例如视频语言基础和视频问答）。 通过协同优化构建的两个区域序列以及关系关注和重构，我们进一步提出了通过视觉实体之间的空间注意力转移的消息传递机制以解决挑战。我们的模型不仅显着优于基线方法，而且能够产生具有视觉意义的事实以支持视觉基础。