Aligning a user query and video clips in cross-modal latent space and that
with semantic concepts are two mainstream approaches for ad-hoc video search
(AVS). However, the effectiveness of existing approaches is bottlenecked by the
small sizes of available video-text datasets and the low quality of concept
banks, which results in the failures of unseen queries and the
out-of-vocabulary problem. This paper addresses these two problems by
constructing a new dataset and developing a multi-word concept bank.
Specifically, capitalizing on a generative model, we construct a new dataset
consisting of 7 million generated text and video pairs for pre-training. To
tackle the out-of-vocabulary problem, we develop a multi-word concept bank
based on syntax analysis to enhance the capability of a state-of-the-art
interpretable AVS method in modeling relationships between query words. We also
study the impact of current advanced features on the method. Experimental
results show that the integration of the above-proposed elements doubles the
R@1 performance of the AVS method on the MSRVTT dataset and improves the xinfAP
on the TRECVid AVS query sets for 2016-2023 (eight years) by a margin from 2%
to 77%, with an average about 20%.

通过构建新数据集和发展多词概念库，本文解决了现有方法在出现未见查询和词汇量问题上的瓶颈，实验结果显示以上所述元素的整合将 AVS 方法在 MSRVTT 数据集上的 R@1 性能翻倍，并将在 2016-2023 年（八年）TRECVid AVS 查询集的 xinfAP 增加了 2% 到 77%，平均约为 20%。

通过生成式标题和多词概念库改进用于即席视频搜索的可解释嵌入

Improving Interpretable Embeddings for Ad-hoc Video Search with  Generative Captions and Multi-word Concept Bank

Scaling up weakly-supervised datasets has shown to be highly effective in the
image-text domain and has contributed to most of the recent state-of-the-art
computer vision and multimodal neural networks. However, existing large-scale
video-text datasets and mining techniques suffer from several limitations, such
as the scarcity of aligned data, the lack of diversity in the data, and the
difficulty of collecting aligned data. Currently popular video-text data mining
approach via automatic speech recognition (ASR) used in HowTo100M provides
low-quality captions that often do not refer to the video content. Other mining
approaches do not provide proper language descriptions (video tags) and are
biased toward short clips (alt text). In this work, we show how recent advances
in image captioning allow us to pre-train high-quality video models without any
parallel video-text data. We pre-train several video captioning models that are
based on an OPT language model and a TimeSformer visual backbone. We fine-tune
these networks on several video captioning datasets. First, we demonstrate that
image captioning pseudolabels work better for pre-training than the existing
HowTo100M ASR captions. Second, we show that pre-training on both images and
videos produces a significantly better network (+4 CIDER on MSR-VTT) than
pre-training on a single modality. Our methods are complementary to the
existing pre-training or data mining approaches and can be used in a variety of
settings. Given the efficacy of the pseudolabeling method, we are planning to
publicly release the generated captions.

本文介绍了利用图像字幕预训练高质量视频模型的方法，并证明了以图像字幕代替自动语音识别字幕的预训练方法更有效，使用图像和视频一起进行预训练比单独使用一种模式的预训练能显著提高网络性能，并且这种方法可以与现有的预训练或数据挖掘方法相辅相成。