In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network. Source code is made publicly available at: https://github.com/bmezaris/TextToVideoRetrieval-TtimesV

本文旨在解决跨模态视频检索问题，具体聚焦于文本到视频的检索，并探讨将多种不同的文本和视觉特征最佳组合以生成多个联合特征空间的方法。通过多空间学习过程训练网络结构，引入额外的softmax运算来修正推断的查询-视频相似性，并在三个大规模数据集上进行实验验证，以记录所提出网络的表现。

所有组合都相等吗？使用多空间学习将文本和视觉特征结合以进行基于文本的视频检索