The correlation between the vision and text is essential for video moment
retrieval (VMR), however, existing methods heavily rely on separate
pre-training feature extractors for visual and textual understanding. Without
sufficient temporal boundary annotations, it is non-trivial to learn universal
video-text alignments. In this work, we explore multi-modal correlations
derived from large-scale image-text data to facilitate generalisable VMR. To
address the limitations of image-text pre-training models on capturing the
video changes, we propose a generic method, referred to as Visual-Dynamic
Injection (VDI), to empower the model's understanding of video moments. Whilst
existing VMR methods are focusing on building temporal-aware video features,
being aware of the text descriptions about the temporal changes is also
critical but originally overlooked in pre-training by matching static images
with sentences. Therefore, we extract visual context and spatial dynamic
information from video frames and explicitly enforce their alignments with the
phrases describing video changes (e.g. verb). By doing so, the potentially
relevant visual and motion patterns in videos are encoded in the corresponding
text embeddings (injected) so to enable more accurate video-text alignments. We
conduct extensive experiments on two VMR benchmark datasets (Charades-STA and
ActivityNet-Captions) and achieve state-of-the-art performances. Especially,
VDI yields notable advantages when being tested on the out-of-distribution
splits where the testing samples involve novel scenes and vocabulary.

研究探究了大规模图文数据中的多模态相关性，并提出了一种通用方法 Visual-Dynamic Injection（VDI）来增强模型对视频时刻的理解及视觉动态信息的提取，从而更准确地进行视频 - 文本对齐，该方法在现有 VMR 方法的基础上取得了显著的进展。

通向可泛化的视频片段检索：通过将视觉动态注入到图像 - 文本预训练中实现

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection  to Image-Text Pre-Training

As a core task in location-based services (LBS) (e.g., navigation maps),
query and point of interest (POI) matching connects users' intent with
real-world geographic information. Recently, pre-trained models (PTMs) have
made advancements in many natural language processing (NLP) tasks. Generic
text-based PTMs do not have enough geographic knowledge for query-POI matching.
To overcome this limitation, related literature attempts to employ
domain-adaptive pre-training based on geo-related corpus. However, a query
generally contains mentions of multiple geographic objects, such as nearby
roads and regions of interest (ROIs). The geographic context (GC), i.e., these
diverse geographic objects and their relationships, is therefore pivotal to
retrieving the most relevant POI. Single-modal PTMs can barely make use of the
important GC and therefore have limited performance. In this work, we propose a
novel query-POI matching method Multi-modal Geographic language model (MGeo),
which comprises a geographic encoder and a multi-modal interaction module. MGeo
represents GC as a new modality and is able to fully extract multi-modal
correlations for accurate query-POI matching. Besides, there is no publicly
available benchmark for this topic. In order to facilitate further research, we
build a new open-source large-scale benchmark Geographic TExtual Similarity
(GeoTES). The POIs come from an open-source geographic information system
(GIS). The queries are manually generated by annotators to prevent privacy
issues. Compared with several strong baselines, the extensive experiment
results and detailed ablation analyses on GeoTES demonstrate that our proposed
multi-modal pre-training method can significantly improve the query-POI
matching capability of generic PTMs, even when the queries' GC is not provided.
Our code and dataset are publicly available at
this https URL.

该研究提出了一种新颖的多模态地理语言模型 (MGeo) 用于查询 - POI 匹配，通过将地理信息视作一个新的模态，在提取多模态相关性的同时准确表示查询中的多个地理对象，提升了通用 PTMs 的查询 - POI 匹配能力。