Though pre-training vision-language models have demonstrated significant
benefits in boosting video-text retrieval performance from large-scale web
videos, fine-tuning still plays a critical role with manually annotated clips
with start and end times, which requires considerable human effort. To address
this issue, we explore an alternative cheaper source of annotations, single
timestamps, for video-text retrieval. We initialise clips from timestamps in a
heuristic way to warm up a retrieval model. Then a video clip editing method is
proposed to refine the initial rough boundaries to improve retrieval
performance. A student-teacher network is introduced for video clip editing.
The teacher model is employed to edit the clips in the training set whereas the
student model trains on the edited clips. The teacher weights are updated from
the student's after the student's performance increases. Our method is model
agnostic and applicable to any retrieval models. We conduct experiments based
on three state-of-the-art retrieval models, COOT, VideoCLIP and CLIP4Clip.
Experiments conducted on three video retrieval datasets, YouCook2, DiDeMo and
ActivityNet-Captions show that our edited clips consistently improve retrieval
performance over initial clips across all the three retrieval models.

通过使用单个时间戳作为廉价的注释来源，本研究提出了一种视频文本检索方法，其中初始视频片段边界从时间戳启动，并通过视频片段编辑方法进行改进，以提高检索性能。实验结果表明，通过编辑视频片段可以持续改善检索性能。

视频编辑对视频检索的应用

Video Editing for Video Retrieval

Web-crawled datasets are pivotal to the success of pre-training
vision-language models, exemplified by CLIP. However, web-crawled AltTexts can
be noisy and potentially irrelevant to images, thereby undermining the crucial
image-text alignment. Existing methods for rewriting captions using large
language models (LLMs) have shown promise on small, curated datasets like CC3M
and CC12M. Nevertheless, their efficacy on massive web-captured captions is
constrained by the inherent noise and randomness in such data. In this study,
we address this limitation by focusing on two key aspects: data quality and
data variety. Unlike recent LLM rewriting techniques, we emphasize exploiting
visual concepts and their integration into the captions to improve data
quality. For data variety, we propose a novel mixed training scheme that
optimally leverages AltTexts alongside newly generated Visual-enriched Captions
(VeC). We use CLIP as one example and adapt the method for CLIP training on
large-scale web-crawled datasets, named VeCLIP. We conduct a comprehensive
evaluation of VeCLIP across small, medium, and large scales of raw data. Our
results show significant advantages in image-text alignment and overall model
performance, underscoring the effectiveness of VeCLIP in improving CLIP
training. For example, VeCLIP achieves a remarkable over 20% improvement in
COCO and Flickr30k retrieval tasks under the 12M setting. For data efficiency,
we also achieve a notable over 3% improvement while using only 14% of the data
employed in the vanilla CLIP and 11% in ALIGN.

本研究关注于通过改善数据质量和数据多样性，特别强调了视觉概念与标题的整合，提出了一种用于 web 爬取数据集训练的新方法 VeCLIP，通过综合评估数据效率和模型性能，证明了 VeCLIP 在改善图片 - 文本对齐和整体模型性能方面的显著优势。