In this paper we tackle the cross-modal video retrieval problem and, more
specifically, we focus on text-to-video retrieval. We investigate how to
optimally combine multiple diverse textual and visual features into feature
pairs that lead to generating multiple joint feature spaces, which encode
text-video pairs into comparable representations. To learn these
representations our proposed network architecture is trained by following a
multiple space learning procedure. Moreover, at the retrieval stage, we
introduce additional softmax operations for revising the inferred query-video
similarities. Extensive experiments in several setups based on three
large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to
best combine text-visual features and document the performance of the proposed
network. Source code is made publicly available at:
this https URL

本文旨在解决跨模态视频检索问题，具体聚焦于文本到视频的检索，并探讨将多种不同的文本和视觉特征最佳组合以生成多个联合特征空间的方法。通过多空间学习过程训练网络结构，引入额外的 softmax 运算来修正推断的查询 - 视频相似性，并在三个大规模数据集上进行实验验证，以记录所提出网络的表现。

所有组合都相等吗？使用多空间学习将文本和视觉特征结合以进行基于文本的视频检索

Are All Combinations Equal? Combining Textual and Visual Features with  Multiple Space Learning for Text-Based Video Retrieval

Multi-modal dialog modeling is of growing interest. In this work, we propose
frameworks to resolve a specific case of multi-modal dialog generation that
better mimics multi-modal dialog generation in the real world, where each
dialog turn is associated with the visual context in which it takes place.
Specifically, we propose to model the mutual dependency between text-visual
features, where the model not only needs to learn the probability of generating
the next dialog utterance given preceding dialog utterances and visual
contexts, but also the probability of predicting the visual features in which a
dialog utterance takes place, leading the generated dialog utterance specific
to the visual context. We observe significant performance boosts over vanilla
models when the mutual dependency between text and visual features is modeled.
Code is available at this https URL

该论文提出了一种解决多模态对话生成的方法，可以更好地模拟真实环境中的多模态对话。通过建模文本 - 视觉特征之间的相互依赖，并研究生成与视觉背景相关的对话，大幅提升了模型性能。