Our objective is video retrieval based on natural language queries. In addition, we consider the analogous problem of retrieving sentences or generating descriptions given an input video. Recent work has addressed the problem by embedding visual and textual inputs into a common space where semantic similarities correlate to distances. We also adopt the embedding approach, and make the following contributions: First, we utilize web image search in sentence embedding process to disambiguate fine-grained visual concepts. Second, we propose embedding models for sentence, image, and video inputs whose parameters are learned simultaneously. Finally, we show how the proposed model can be applied to description generation. Overall, we observe a clear improvement over the state-of-the-art methods in the video and sentence retrieval tasks. In description generation, the performance level is comparable to the current state-of-the-art, although our embeddings were trained for the retrieval tasks.

该研究旨在基于自然语言查询进行视频检索，并采用嵌入模型进行检索任务的训练，试图通过图像搜索以及嵌入模型的应用使 fine-grained 视觉概念得到消歧，最终在视频和句子检索任务中实现了明显的改进，并取得了与当前最先进技术相媲美的描述生成性能。

使用网络图像搜索学习视频和句子的联合表示