We introduce TV show Retrieval (TVR), a new multimodal retrieval dataset. TVR
requires systems to understand both videos and their associated subtitle
(dialogue) texts, making it more realistic. The dataset contains 109K queries
collected on 21.8K videos from 6 TV shows of diverse genres, where each query
is associated with a tight temporal window. The queries are also labeled with
query types that indicate whether each of them is more related to video or
subtitle or both, allowing for in-depth analysis of the dataset and the methods
that built on top of it. Strict qualification and post-annotation verification
tests are applied to ensure the quality of the collected data. Further, we
present several baselines and a novel Cross-modal Moment Localization (XML )
network for multimodal moment retrieval tasks. The proposed XML model uses a
late fusion design with a novel Convolutional Start-End detector (ConvSE),
surpassing baselines by a large margin and with better efficiency, providing a
strong starting point for future work. We have also collected additional
descriptions for each annotated moment in TVR to form a new multimodal
captioning dataset with 262K captions, named TV show Caption (TVC). Both
datasets are publicly available. TVR: this https URL, TVC:
this https URL

本研究介绍了一种新的多模式检索数据集，名为电视节目检索 (TVR)，它结合了视频和相关的字幕文本，其中包含 109K 个查询，每个查询与一个精确的时间窗口相关联，并且具有指示查询与视频或字幕相关性的查询类型标签。我们还提出了多模态时刻检索任务的一种新型跨模态时刻定位网络 (XML)，该模型采用了一种新颖的卷积起始和结束检测器 (ConvSE) 模型，具有更好的效率和性能。同时，我们还收集了描述 TVR 中各个标注时刻的相关信息，形成了一个新的多模式字幕数据集 TVC，两个数据集均可以公开获取。